cancel
Showing results for 
Search instead for 
Did you mean: 

OCR Integration with Alfresco

ashpal19
Champ in-the-making
Champ in-the-making
Currently alfresco can provide search capability for textual content. We would also like to search on text inside images and PDF's with text inside the images.

We have an existing Webservice which will provide OCR functionality. (i.e. Read a document and return the text data back) I would like to know how do I integrate this existing service with Alfresco.

I need specific pointers where to start? (I am also aware that Alfresco uses Solr indexing to index the documents which are marked with "Is Indexable" aspect.)

Please help.





6 REPLIES 6

jpotts
World-Class Innovator
World-Class Innovator
One idea would be to write a custom transformer that transforms from your source mimetype to text by leveraging your existing OCR web service. The reason is that the full-text indexing mechanism will leverage the transformer when it tries to find text it can ingest into the index. So if you have written a transformer that does that, your content will be indexed.

Jeff

ashpal19
Champ in-the-making
Champ in-the-making
Hi Jeff,
Thank you for your suggestion. I agree to this approach of custom content transformer, because by doing this the content that we have transformed using our 3rd party service will be indexed automatically by alfresco.

I have written a custom tranformer class -

package org.alfresco.repo.content.transform;

import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.Writer;

import org.alfresco.service.cmr.repository.ContentReader;
import org.alfresco.service.cmr.repository.ContentWriter;
import org.alfresco.service.cmr.repository.TransformationOptions;

public class OCRContentTransformer extends AbstractContentTransformer2 {

   @Override
   protected void transformInternal(ContentReader reader, ContentWriter writer,
         TransformationOptions options) throws Exception {
   
      System.out.println("inside the transform internal method and now the index would be updated with the latest content");
       //transformText(reader, writer, options);
     }
   
   

   
   @Override
   public boolean isTransformableMimetype(String sourceMimetype,
         String targetMimetype, TransformationOptions options) {
      // TODO Auto-generated method stub
      return super.isTransformableMimetype(sourceMimetype, targetMimetype, options);
   }
   
}

and I am also referring to this class in a custom context file my-transformers-context.xml (attached) (placed under C:\Alfresco\tomcat\shared\classes\alfresco\extension)

I am getting the attached error.

I also referred to the wiki - https://wiki.alfresco.com/wiki/Content_Transformations#Developing_New_Transformations

Note: I have tried this class with and without overriding the isTransformableMimetype()

Your help is highly appreciated.

ashpal19
Champ in-the-making
Champ in-the-making
I have got this to work, please ignore my previous response.

ashpal19
Champ in-the-making
Champ in-the-making
Please look at the attached context file and Java class for the custom content transformer. This transformer is getting invoked almost everytime, even when I click on a particular file to view it. I understood that because I have added sysouts in the isTransformableMimetype() method.

Further the transformInternal() method is not getting called. I am not sure why it is not getting called? I am confused what am I doing wrong? Can you please guide me further, if there is any additional configuration I am missing, or I need to add further details in the custom java class?

Thank you for your help, I really appreciate it.

jpotts
World-Class Innovator
World-Class Innovator
What are the specific steps you are following to test this?

Jeff

ashpal19
Champ in-the-making
Champ in-the-making
I was testing this process through the alfresco share UI, by uploading pdf's or image files. I figured out after struggling for a long time that, "If two transformers perform the same transformation, the most reliable one will always be chosen."(this is also mentioned in the wiki which I noticed later) Hence my custom tranformer's transform internal method was never being used up since there were existing more reliable transformers for the same transformation.

Then I defined a new tranformation which does not exist in the current alfresco configuration and tested with it (image/jpeg –> text/plain). Now it invokes the transformInternal() method properly. However I still have another question - why does this transformation get called even when I click on that particular document type to view it? Will the transformation run even when I view the document?

Thank you for your help Jeff and thank you for responding to my query.