Hyland Connect

pjaromin · ‎05-27-2012

I'm relatively new to Alfresco and have recently setup an environment where scanned bitmaps are run through a transformer for text/plain through tesseract OCR. This works brilliantly for single-page documents scanned into PNG, JPEG, TIFF, etc.

For multi-page documents my scanner will create a PDF. However, the standard transformer for text/plain obviously doesn't do OCR. For documents in this "scanner" space I'd always want to run them through OCR (probably a custom transformer which I have no trouble coding). However I don't wish to remove/override the PDFBox transformer for the majority of PDFs that already contain extract-able text.

So what's the best solution here? I'm thinking I could extend the PDFBox one to include OCR and merge the results, but this seems a bit messy. Is there a way to chain multiple transformers together for a given mime-type? Or is there a way to specify a specific transformer based on the space?

Or should I create a rule that runs on PDFs in this space to OCR them and place the text in a specific property that's set to searchable? Or perhaps something else I'm completely oblivious to.

Suggestions?

Thanks!

-Patrick

zaizi · ‎05-29-2012

Custom transformer that uses PDFBox to extract text. If the extracted text count is 0 or less than a small value, OCR them.

When we looked into this before, the recommendation to determine if a PDF is an image or a text PDF was to see if there were any embedded fonts.

Ainga

wmay · ‎08-01-2012

Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739

Hyland Connect

OCR Scanned PDF for Search Indexing