cancel
Showing results for 
Search instead for 
Did you mean: 

OCR for images, pdfs etc

boneill
Star Contributor
Star Contributor
Hi Guys,

Does anyone have any advice on how to integrate an OCR service into alfresco..  I understand that OCR is normally done by apps like Kofax but our client would like to be able to upload an image or scanned pdf and let Alfresco handle the OCR step so that the docs can be found during search. 

Any advice, suggestions or experience in this would be greatly appreciated.

Regards

Brian
10 REPLIES 10

jpotts
World-Class Innovator
World-Class Innovator
Alfresco doesn't provide OCR capabilities out-of-the-box. You might take a look at http://www.ephesoft.com/ and see if that can be of assistance.

The Add-Ons directory also has a number of OCR solutions: http://addons.alfresco.com/search/node/ocr

If you want to roll up your sleeves and do your own integration without relying on an integration that's already been built, you can find various OCR libraries out there. Here's one: http://code.google.com/p/tesseract-ocr/.

Jeff

boneill
Star Contributor
Star Contributor
Hi Jeff,

Thanks for the response.  This is exactly the information I needed.

Brian

djnemo2
Champ in-the-making
Champ in-the-making
Hi,

Have you tried any of those solutions ?

What is the best(even non-free) solution to have scan and save in alfresco ?
(i think it most be compatible with alfresco to add some metadata/tags to alfresco for every document that add to alfresco for search and …)

Thanks

jpotts
World-Class Innovator
World-Class Innovator
Metadata extraction is available out-of-the-box. But if you are uploading an image of the document there is no metadata to extract. You need something to convert the image to machine readable text. That's OCR and is not available out-of-the-box.

Jeff

djnemo2
Champ in-the-making
Champ in-the-making
Is there any third party software that someone already used for this ?
That scan the document, Based on Contents Save it in good directory on server and give report that which document is where ?

Thank you

susannamoore
Champ in-the-making
Champ in-the-making
There are lots many <a href="http://www.rasteredge.com/dotnet-imaging/addon-ocr-sdk/">OCR software</a> that can do the work.

I generally use RE.OCR.SDK.

The page you linked is for .net.
Is there a Java integration as well or was this just a spambot?

susannamoore
Champ in-the-making
Champ in-the-making
Hi, Souil

I just tried this, but not sure weather this site provides the one for java integration.
maybe just imaging processing library for java.

jpotts
World-Class Innovator
World-Class Innovator
There was recently some discussion on IRC about this project:
https://code.google.com/p/alfresco-tesseract-search/

Out-of-the-box it was not working with 4.2 but one of our community members did some quick repackaging and got it working on 4.2 in about 30 minutes.

After doing that, he was able to take scanned images, check them in to Alfresco, and then do a full-text search against them. The tesseract OCR piece was responsible for extracting the text from the scanned images and making it available to the indexer.

Jeff