topic Re: OCR pdf Scans search Alfresco Community 5 in Alfresco Archive

OCR pdf Scans search Alfresco Community 5

r_grandits — Tue, 22 Sep 2015 06:14:50 GMT

Hy,im currenty evaluating alfresco community.my key point is that i want to integrate alfresco as a document management system with the target of paperfree work.so i integrated tesseract to OCR my uploaded tiff, jpg and png files - it works fine all text is in the search index.but what i need, woul

Re: OCR pdf Scans search Alfresco Community 5

villdre — Tue, 22 Sep 2015 06:54:10 GMT

Hi - we faced the same problem. Solved it by first converting the PDF to JPEG or PNG files and then running tesseract on the JPEG or PNG files.

I use the following command to burst a multi-page PDF into individual pages:

pdftk test.pdf burst‍

Then convert each PDF page into JPEG:

 convert  -density 175 page1.pdf temp_1.jpg ‍

Then run Tessearct on each JPEG, using the PDF output option.

 tesseract temp_1.jpg target_page_1 pdf ‍

Then use PDF Unite on all the PDF files

pdfunite $tempfolder_tess/*.pdf final.pdf ‍

Re: OCR pdf Scans search Alfresco Community 5

aadamnz — Mon, 09 May 2016 23:34:53 GMT

Hi Rene
I have alfresco 5.0.d installed on ubuntu 14.04 with alfresco install wizard.
when i try to place an xml bean in shared/classes/alfresco/extension i cant login to the alfresco share and
i got solr errors .can you tell me how you integrated tesseract with alfresco.
Thanks
Aadam