OCR pdf Scans search Alfresco Community 5

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-22-2015 02:14 AM
Hy,
im currenty evaluating alfresco community.
my key point is that i want to integrate alfresco as a document management system with the target of paperfree work.
so i integrated tesseract to OCR my uploaded tiff, jpg and png files - it works fine all text is in the search index.
but what i need, would be that tesseract also processes my uploaded pdf scan files, we have a lot of them.
how can i realize this with tesseract?
thank you
rene
im currenty evaluating alfresco community.
my key point is that i want to integrate alfresco as a document management system with the target of paperfree work.
so i integrated tesseract to OCR my uploaded tiff, jpg and png files - it works fine all text is in the search index.
but what i need, would be that tesseract also processes my uploaded pdf scan files, we have a lot of them.
how can i realize this with tesseract?
thank you
rene
Labels:
- Labels:
-
Archive
2 REPLIES 2

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-22-2015 02:54 AM
Hi - we faced the same problem. Solved it by first converting the PDF to JPEG or PNG files and then running tesseract on the JPEG or PNG files.
I use the following command to burst a multi-page PDF into individual pages:
Then convert each PDF page into JPEG:
Then run Tessearct on each JPEG, using the PDF output option.
Then use PDF Unite on all the PDF files
I use the following command to burst a multi-page PDF into individual pages:
pdftk test.pdf burst
Then convert each PDF page into JPEG:
convert -density 175 page1.pdf temp_1.jpg
Then run Tessearct on each JPEG, using the PDF output option.
tesseract temp_1.jpg target_page_1 pdf
Then use PDF Unite on all the PDF files
pdfunite $tempfolder_tess/*.pdf final.pdf
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2016 07:34 PM
Hi Rene
I have alfresco 5.0.d installed on ubuntu 14.04 with alfresco install wizard.
when i try to place an xml bean in shared/classes/alfresco/extension i cant login to the alfresco share and
i got solr errors .can you tell me how you integrated tesseract with alfresco.
Thanks
Aadam
I have alfresco 5.0.d installed on ubuntu 14.04 with alfresco install wizard.
when i try to place an xml bean in shared/classes/alfresco/extension i cant login to the alfresco share and
i got solr errors .can you tell me how you integrated tesseract with alfresco.
Thanks
Aadam
