cancel
Showing results for 
Search instead for 
Did you mean: 

OCR pdf Scans search Alfresco Community 5

r_grandits
Champ in-the-making
Champ in-the-making
Hy,

im currenty evaluating alfresco community.
my key point is that i want to integrate alfresco as a document management system with the target of paperfree work.
so i integrated tesseract to OCR my uploaded tiff, jpg and png files - it works fine all text is in the search index.

but what i need,  would be that tesseract also processes my uploaded pdf scan files, we have a lot of them.
how can i realize this with tesseract?

thank you
rene
2 REPLIES 2

villdre
Confirmed Champ
Confirmed Champ
Hi - we faced the same problem. Solved it by first converting the PDF to JPEG or PNG files and then running tesseract on the JPEG or PNG files.

I use the following command to burst a multi-page PDF into individual pages:
pdftk test.pdf burst


Then convert each PDF page into JPEG:
 convert  -density 175 page1.pdf temp_1.jpg 


Then run Tessearct on each JPEG, using the PDF output option.
 tesseract temp_1.jpg target_page_1 pdf 


Then use PDF Unite on all the PDF files
pdfunite $tempfolder_tess/*.pdf final.pdf 


aadamnz
Champ in-the-making
Champ in-the-making
Hi Rene
I have alfresco 5.0.d installed on ubuntu 14.04 with alfresco install wizard.
when i try to place an xml bean in shared/classes/alfresco/extension i cant login to the alfresco share and
i got solr errors .can you tell me how you integrated tesseract with alfresco.
Thanks
Aadam