topic Re: OCR for images, pdfs etc in Alfresco Archive

OCR for images, pdfs etc

boneill — Fri, 31 Jan 2014 06:51:47 GMT

Hi Guys,Does anyone have any advice on how to integrate an OCR service into alfresco.. I understand that OCR is normally done by apps like Kofax but our client would like to be able to upload an image or scanned pdf and let Alfresco handle the OCR step so that the docs can be found during search.

Re: OCR for images, pdfs etc

jpotts — Fri, 31 Jan 2014 16:43:17 GMT

Alfresco doesn't provide OCR capabilities out-of-the-box. You might take a look at http://www.ephesoft.com/ and see if that can be of assistance.

The Add-Ons directory also has a number of OCR solutions: http://addons.alfresco.com/search/node/ocr

If you want to roll up your sleeves and do your own integration without relying on an integration that's already been built, you can find various OCR libraries out there. Here's one: http://code.google.com/p/tesseract-ocr/.

Jeff

Re: OCR for images, pdfs etc

boneill — Mon, 03 Feb 2014 04:48:31 GMT

Hi Jeff,

Thanks for the response. This is exactly the information I needed.

Brian

Re: OCR for images, pdfs etc

djnemo2 — Wed, 05 Feb 2014 07:41:54 GMT

Hi,

Have you tried any of those solutions ?

What is the best(even non-free) solution to have scan and save in alfresco ?
(i think it most be compatible with alfresco to add some metadata/tags to alfresco for every document that add to alfresco for search and …)

Thanks

Re: OCR for images, pdfs etc

jpotts — Wed, 05 Feb 2014 19:34:27 GMT

Metadata extraction is available out-of-the-box. But if you are uploading an image of the document there is no metadata to extract. You need something to convert the image to machine readable text. That's OCR and is not available out-of-the-box.

Jeff

Re: OCR for images, pdfs etc

djnemo2 — Thu, 06 Feb 2014 08:32:55 GMT

Is there any third party software that someone already used for this ?
That scan the document, Based on Contents Save it in good directory on server and give report that which document is where ?

Thank you

Re: OCR for images, pdfs etc

susannamoore — Tue, 18 Feb 2014 09:02:00 GMT

There are lots many <a href="http://www.rasteredge.com/dotnet-imaging/addon-ocr-sdk/">OCR software</a> that can do the work.

I generally use RE.OCR.SDK.

Re: OCR for images, pdfs etc

scouil — Tue, 18 Feb 2014 09:18:41 GMT

The page you linked is for .net.
Is there a Java integration as well or was this just a spambot?

Re: OCR for images, pdfs etc

susannamoore — Wed, 19 Feb 2014 02:04:30 GMT

Hi, Souil

I just tried this, but not sure weather this site provides the one for java integration.
maybe just imaging processing library for java.

Re: OCR for images, pdfs etc

jpotts — Wed, 19 Feb 2014 22:37:10 GMT

There was recently some discussion on IRC about this project:
https://code.google.com/p/alfresco-tesseract-search/

Out-of-the-box it was not working with 4.2 but one of our community members did some quick repackaging and got it working on 4.2 in about 30 minutes.

After doing that, he was able to take scanned images, check them in to Alfresco, and then do a full-text search against them. The tesseract OCR piece was responsible for extracting the text from the scanned images and making it available to the indexer.

Jeff

Re: OCR for images, pdfs etc

krutik_jayswal — Sun, 31 Jul 2016 18:10:36 GMT

Adding more information on OCR in alfresco.

http://www.krutikjayswal.com/2016/07/ocr-on-pdf-file-in-alfresco.html