cancel
Showing results for 
Search instead for 
Did you mean: 

integrate tesseract ocr into alfresco

tyshan
Champ in-the-making
Champ in-the-making
Hi,

I integrated tesseract ocr into alfresco 5.0.d, it worked very well and supported tiff, png, jepg media format.

But I hope to save the detected text content into alfresco, which can be indexed by solr and searched.

Is there any good solution for this usecase?

Thanks in advance.

Tyshan
4 REPLIES 4

noferdito
Champ in-the-making
Champ in-the-making
Hi!
I'm evaluating Alfresco 5, and i don't found any tutorial about tesseract integration on 5.0x versions. How did you do the integration?
Thanks in advance,

J.

[Not the Op]
Did you try this tutorial by board member dougalscrp:
http://www.seedim.com.au/content/alfresco-search-pdf-images-using-transformations-and-tesseract-ocr

I didn't integrate ocr into Alfresco myself yet, but it's on my todo-list.

boneill
Star Contributor
Star Contributor
Hi Tyshan,

The seedim.com.au tutorial should tell you how it works.  Basically, if you configure a transformation for each image mimetype (ie png, tiffs etc) to text (I assume using the tesseract transform you have already configured) then when an image is uploaded solr will try to call the img-to-text transform you have configured to get the wordlist.  The wordlist is then automatically added to the solr index and points to the image content.    Searching will therefore find the image based on the text in the image.

Hope this helps.

Brian

krutik_jayswal
Elite Collaborator
Elite Collaborator