cancel
Showing results for 
Search instead for 
Did you mean: 

Server Side OCR with ABBYY Recognition Server

abruzzi
Champ on-the-rise
Champ on-the-rise
Just thought I'd post an FYI, for anyone looking to integrate server side OCR into Alfresco, we've found a pretty good solution:  ABBYY Recognition server.  It doesn't integrate out of the box, but since it provides a SOAP interface, it's pretty easy to whip up a script and transformer.  Since I know PHP and PHP has some easy SOAP functionality, I wrote my script in PHP (excuse the ugly code):

<?php

class files_object{
   public $FileName;
   public $FileContents;
}

class ProcessFile{

   public $location = "ogre.co.dac.int";
   public $workflowName;
   public $file;
};



if ($argv[1]=="–help"||$argv[1]=="-h") {
   print "Usage: php ocr.php <source file> <destination file> <ocr workflow name>\n";
   return 0;
} else {

   $input_file_name = $argv[1];
   $output_file_name = $argv[2];
   $workflow = $argv[3];
   
   if (is_null($input_file_name) || is_null($output_file_name) || is_null($workflow) ) {
      print "Usage: php ocr.php <source file> <destination file> <ocr workflow name>\n";
      return 2;
   }

   if(!file_exists($input_file_name) || !is_readable($input_file_name)) {
      print "Input file cannot be read or does not exist. Exiting.\n";
      return 2;
   } else {

      $input_filehandle = fopen($input_file_name, "r");
      $input_file_content = fread($input_filehandle, filesize($input_file_name));
      fclose($input_filehandle);

      $file = new files_object;
      $file->FileName = basename($input_file_name);
      $file->FileContents = $input_file_content;


      $soap_process = new ProcessFile;

      $soap_process->workflowName = $workflow;
      $soap_process->file = $file;

      $client = new SoapClient("http://ogre.co.dac.int/RecognitionWS/RSSoapService.asmx?wsdl");

      $results = $client->ProcessFile($soap_process);

      $content = $results->ProcessFileResult->InputFiles->InputFile->OutputDocuments->OutputDocument->Files->FileContainer->FileContents;
      $name = $results->ProcessFileResult->InputFiles->InputFile->OutputDocuments->OutputDocument->Files->FileContainer->FileName;

      $output_filehandle = fopen($output_file_name, "w");

      fwrite($output_filehandle, $content);

      fclose($output_filehandle);
   }
}
?>

This script takes the command line:

php ocr.php {source file} {target file} {workflow name}

(note: recognition server can define multiple workflows.  Currently I have an OCRtoPDF which returns a pdf document and OCRtoTXT which returns plain text.)

Then we simple created a new context file: (this example is for  tiff->pdf and tiff->txt


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>

   <bean id="transformer.TIFF.OCR" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key="Linux">
                        <value>php /srv/alfresco/bin/ocr.php ${source} ${target} OCRtoPDF</value>
                    </entry>
                </map>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
                <property name="sourceMimetype"><value>image/tiff</value></property>
                <property name="targetMimetype"><value>application/pdf</value></property>
            </bean>
         </list>
      </property>
   </bean>

   <bean id="transformer.TIFF.TXT" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key="Linux">
                        <value>php /srv/alfresco/bin/ocr.php ${source} ${target} OCRtoTXT</value>
                    </entry>
                </map>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
                <property name="sourceMimetype"><value>image/tiff</value></property>
                <property name="targetMimetype"><value>text/plain</value></property>
            </bean>
         </list>
      </property>
   </bean>

</beans>


We obviously have other mime conversions, but this is the core of it.  The nice thing is the image to txt transforms mean that anytime a jpeg, tiff, or png are uploaded, the image->txt conversion fires off and indexes any text found on the image while leaving the document intact in it's original version.

Recognition server seems to have pretty good accuracy and can be spread over multiple systems to speed things up.  The other benefit is it is surprisingly inexpensive.  Since Kofax client side OCR is the only Alfresco "supported" OCR, hopefully ABBYY RS will be a good alternative if you need or prefer server side OCR

Geof
2 REPLIES 2

dranakan
Champ on-the-rise
Champ on-the-rise
Hello,

I have readed some posts about Abbyy OCR and I have questions…

Have you tried this on Linux(Redhat) 64 bits ? Working ?

I would to create a rule able to transform a PDF (or TIFF) -> PDF Readable (and not create a second document). Is it possisble ? How can I custom the Web UI to do that ?

Thank you

wmay
Champ in-the-making
Champ in-the-making
Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on  a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739