cancel
Showing results for 
Search instead for 
Did you mean: 

OCR In Alfresco

ananthu
Champ in-the-making
Champ in-the-making
Hi,

How we can search Scanned files in Alfresco. is there any way to configure the Alfresco.
4 REPLIES 4

sujaypillai
Confirmed Champ
Confirmed Champ
Your scanned files should be machine readable text which can be achieved using any OCR client.
Refer these blogs -
https://tpeelen.wordpress.com/2010/12/17/alfresco-using-tesseract-ocr-on-ubuntu-linux
https://www.surevine.com/a-little-alfresco-tesseract-ocr-integration

maotsu
Champ in-the-making
Champ in-the-making
the tutorial shown at https://www.surevine.com/a-little-alfresco-tesseract-ocr-integration/ seems obsolete. so i want to know if there is an updated tutorial for windows users??

samudaya
Champ on-the-rise
Champ on-the-rise
Hi All,

What is your experience and opinion for simple OCR automate tool for Alfresco Community Edition 5.0.d? I want to upload just scan file and I want to search by content? 

urielefren
Champ in-the-making
Champ in-the-making
Hi, I wonder if anyone know the correct configuration for alfresco community 5.0.d for OCR integration (tesseract). I searched on web and I make my xml files for make transformations and indexing in alfresco search engine. It works for tiff files but other formats it doesn't work like (png, jpg, bmp, pdf , etc.)

I saw in the alfresco.log, shared.log and solr4.log and there's not error displayed so I can think my configuration is correct, I guess. I attached my files to this post.


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>

  <bean id="transformer.OCRToText" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
      <property name="worker">
         <ref bean="transformer.worker.OCRToText" />
      </property>
  </bean>

  <bean id="transformer.worker.OCRToText" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
      <property name="mimetypeService">
         <ref bean="MimetypeService"/>
      </property>

      <property name="checkCommand">
        <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
          <property name="commandsAndArguments">
            <map>
              <entry key="Mac OS X">
                <list>
                  <value>/usr/bin/python</value>
                  <value>–version</value>
                </list>
              </entry>
            </map>
          </property>
          <property name="errorCodes">
            <value>1</value>
          </property>
        </bean>
      </property>

      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandsAndArguments">
                <map>
                    <entry key="Mac OS X">
                        <list>
                            <value>/usr/bin/python</value>
                            <value>/Applications/alfresco-5.0.d/bin/ocr-simple.py</value>
                            <value>${source}</value>
                            <value>${target}</value>
                        </list>
                    </entry>
                </map>
            </property>
         </bean>
      </property>

    <property name="explicitTransformations">
      <list>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/tiff</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/png</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/jpg</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/bmp</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>application/pdf</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>

      </list>
    </property>

  </bean>
</beans>

and here's my script:


#!/usr/bin/python
#author: http://blyx.com/2010/11/30/integracion-de-ocr-en-alfresco/
#modified by : Uriel Efren Carballido Acosta
from os import popen
from string import split,join
from pprint import *
import re
import sys

def uniq(seq):
    # Not order preserving
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()


# Tamano maximo
maxstr  = 3

print(sys.argv[1])

commadArgs = ""
b = None
newFile = ""
if ".tif" in sys.argv[1]:
   b = True
   newFile = sys.argv[1]
elif ".png" or ".jpg" or ".jpeg" or ".bmp" or ".pdf" in sys.argv[1]:
   path = sys.argv[1].rsplit('/', 1)
   fileN = ""
   direction = ""
   if len(path) > 1:
      fileN = path[1];
      direction = path[0]+"/"
   else:
      fileN = sys.argv[1];
      direction = ""
   partF = fileN.rsplit('.', 1)
   newFile = "/tmp/"+partF[0]+'.tif'#direction+partF[0]+'.tif'
   print(newFile)


   command = popen('/usr/local/bin/convert -resize 400% -type Grayscale '+sys.argv[1]+' '+newFile) #usando convert para convertir imagenes a tif
   b = True

if b:
   command = popen('/usr/local/bin/tesseract -l spa '+newFile+' /tmp/tesser-$$ 2> /dev/null; cat /tmp/tesser-$$.txt')
   #command = popen('/usr/local/bin/tesseract -l spa+eng '+newFile+' /tmp/tesser-$$ 2> /dev/null; cat /tmp/tesser-$$.txt') #usando tesseract
   
   lines   = command.readlines()
   zz = open("/tmp/ocr.log","w")
   zz.write(sys.argv[1]+"\n")  
   zz.write(join(uniq(lines), " "))

   # Palabras unicas
   outputf = open(sys.argv[2],"w")
   outputf.write(join(uniq(lines), " "))
   return sys.argv[2]
else:
   print("archivo no soportado")
   zz = open("/tmp/ocr.log","w")
   zz.write("File not compatible for OCR transformation:::"+sys.argv[1]+"\n")


The script works fine, I tested in my terminal so the files are generated correctly and I can see the txt file, but in alfresco something happens when the reindexing try to trigger the script for the others file (the only type files which works fine are tiff and tif).

I know this configuration is for Alfresco versions which have the indexing Lucene engine and in Solr4 doesn't work but I cannot find information for this purpose for the alfresco community 5.x and solr4.

I anyone knows something please shared It will be very appreciated.

I apologize for my bad english Smiley Tongue.

Regards…