cancel
Showing results for 
Search instead for 
Did you mean: 

OCR Scanned PDF for Search Indexing

pjaromin
Champ on-the-rise
Champ on-the-rise
I'm relatively new to Alfresco and have recently setup an environment where scanned bitmaps are run through a transformer for text/plain through tesseract OCR. This works brilliantly for single-page documents scanned into PNG, JPEG, TIFF, etc.

For multi-page documents my scanner will create a PDF. However, the standard transformer for text/plain obviously doesn't do OCR. For documents in this  "scanner" space I'd always want to run them through OCR (probably a custom transformer which I have no trouble coding). However I don't wish to remove/override the PDFBox transformer for the majority of PDFs that already contain extract-able text.

So what's the best solution here? I'm thinking I could extend the PDFBox one to include OCR and merge the results, but this seems a bit messy. Is there a way to chain multiple transformers together for a given mime-type? Or is there a way to specify a specific transformer based on the space?

Or should I create a rule that runs on PDFs in this space to OCR them and place the text in a specific property that's set to searchable? Or perhaps something else I'm completely oblivious to.

Suggestions?

Thanks!

-Patrick
2 REPLIES 2

zaizi
Champ in-the-making
Champ in-the-making
Custom transformer that uses PDFBox to extract text. If the extracted text count is 0 or less than a small value, OCR them.

When we looked into this before, the recommendation to determine if a PDF is an image or a text PDF was to see if there were any embedded fonts.

Ainga

wmay
Champ in-the-making
Champ in-the-making
Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on  a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739