topic Re: OCR Scanned PDF for Search Indexing in Alfresco Archive

OCR Scanned PDF for Search Indexing

pjaromin — Sun, 27 May 2012 18:46:24 GMT

I'm relatively new to Alfresco and have recently setup an environment where scanned bitmaps are run through a transformer for text/plain through tesseract OCR. This works brilliantly for single-page documents scanned into PNG, JPEG, TIFF, etc.For multi-page documents my scanner will create a PDF. Ho

Re: OCR Scanned PDF for Search Indexing

zaizi — Tue, 29 May 2012 20:34:24 GMT

Custom transformer that uses PDFBox to extract text. If the extracted text count is 0 or less than a small value, OCR them.

When we looked into this before, the recommendation to determine if a PDF is an image or a text PDF was to see if there were any embedded fonts.

Ainga

Re: OCR Scanned PDF for Search Indexing

wmay — Wed, 01 Aug 2012 14:39:47 GMT

Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739