Hyland Connect

r_grandits · ‎09-22-2015

Hy,

im currenty evaluating alfresco community.
my key point is that i want to integrate alfresco as a document management system with the target of paperfree work.
so i integrated tesseract to OCR my uploaded tiff, jpg and png files - it works fine all text is in the search index.

but what i need, would be that tesseract also processes my uploaded pdf scan files, we have a lot of them.
how can i realize this with tesseract?

thank you
rene

villdre · ‎09-22-2015

Hi - we faced the same problem. Solved it by first converting the PDF to JPEG or PNG files and then running tesseract on the JPEG or PNG files.

I use the following command to burst a multi-page PDF into individual pages:

pdftk test.pdf burst‍

Then convert each PDF page into JPEG:

 convert  -density 175 page1.pdf temp_1.jpg ‍

Then run Tessearct on each JPEG, using the PDF output option.

 tesseract temp_1.jpg target_page_1 pdf ‍

Then use PDF Unite on all the PDF files

pdfunite $tempfolder_tess/*.pdf final.pdf ‍

aadamnz · ‎05-09-2016

Hi Rene
I have alfresco 5.0.d installed on ubuntu 14.04 with alfresco install wizard.
when i try to place an xml bean in shared/classes/alfresco/extension i cant login to the alfresco share and
i got solr errors .can you tell me how you integrated tesseract with alfresco.
Thanks
Aadam

Hyland Connect

OCR pdf Scans search Alfresco Community 5