Hyland Connect

ananthu · ‎07-16-2015

Hi,

How we can search Scanned files in Alfresco. is there any way to configure the Alfresco.

sujaypillai · ‎07-17-2015

Your scanned files should be machine readable text which can be achieved using any OCR client.
Refer these blogs -
https://tpeelen.wordpress.com/2010/12/17/alfresco-using-tesseract-ocr-on-ubuntu-linux
https://www.surevine.com/a-little-alfresco-tesseract-ocr-integration

maotsu · ‎12-27-2015

the tutorial shown at https://www.surevine.com/a-little-alfresco-tesseract-ocr-integration/ seems obsolete. so i want to know if there is an updated tutorial for windows users??

samudaya · ‎07-01-2016

Hi All,

What is your experience and opinion for simple OCR automate tool for Alfresco Community Edition 5.0.d? I want to upload just scan file and I want to search by content?

urielefren · ‎01-08-2016

Hi, I wonder if anyone know the correct configuration for alfresco community 5.0.d for OCR integration (tesseract). I searched on web and I make my xml files for make transformations and indexing in alfresco search engine. It works for tiff files but other formats it doesn't work like (png, jpg, bmp, pdf , etc.)

I saw in the alfresco.log, shared.log and solr4.log and there's not error displayed so I can think my configuration is correct, I guess. I attached my files to this post.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>

<bean id="transformer.OCRToText" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
      <property name="worker">
         <ref bean="transformer.worker.OCRToText" />
      </property>
</bean>

<bean id="transformer.worker.OCRToText" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
      <property name="mimetypeService">
         <ref bean="MimetypeService"/>
      </property>

      <property name="checkCommand">
        <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
          <property name="commandsAndArguments">
            <map>
              <entry key="Mac OS X">
                <list>
                  <value>/usr/bin/python</value>
                  <value>–version</value>
                </list>
              </entry>
            </map>
          </property>
          <property name="errorCodes">
            <value>1</value>
          </property>
        </bean>
      </property>

      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandsAndArguments">
                <map>
                    <entry key="Mac OS X">
                        <list>
                            <value>/usr/bin/python</value>
                            <value>/Applications/alfresco-5.0.d/bin/ocr-simple.py</value>
                            <value>${source}</value>
                            <value>${target}</value>
                        </list>
                    </entry>
                </map>
            </property>
         </bean>
      </property>

    <property name="explicitTransformations">
      <list>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/tiff</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/png</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/jpg</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>image/bmp</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
          <property name="sourceMimetype"><value>application/pdf</value></property>
          <property name="targetMimetype"><value>text/plain</value></property>
        </bean>

      </list>
    </property>

</bean>
</beans>

and here's my script:

#!/usr/bin/python
#author: http://blyx.com/2010/11/30/integracion-de-ocr-en-alfresco/
#modified by : Uriel Efren Carballido Acosta
from os import popen
from string import split,join
from pprint import *
import re
import sys

def uniq(seq):
    # Not order preserving
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()

# Tamano maximo
maxstr = 3

print(sys.argv[1])

commadArgs = ""
b = None
newFile = ""
if ".tif" in sys.argv[1]:
   b = True
   newFile = sys.argv[1]
elif ".png" or ".jpg" or ".jpeg" or ".bmp" or ".pdf" in sys.argv[1]:
   path = sys.argv[1].rsplit('/', 1)
   fileN = ""
   direction = ""
   if len(path) > 1:
      fileN = path[1];
      direction = path[0]+"/"
   else:
      fileN = sys.argv[1];
      direction = ""
   partF = fileN.rsplit('.', 1)
   newFile = "/tmp/"+partF[0]+'.tif'#direction+partF[0]+'.tif'
   print(newFile)

   command = popen('/usr/local/bin/convert -resize 400% -type Grayscale '+sys.argv[1]+' '+newFile) #usando convert para convertir imagenes a tif
   b = True

if b:
   command = popen('/usr/local/bin/tesseract -l spa '+newFile+' /tmp/tesser-$$ 2> /dev/null; cat /tmp/tesser-$$.txt')
   #command = popen('/usr/local/bin/tesseract -l spa+eng '+newFile+' /tmp/tesser-$$ 2> /dev/null; cat /tmp/tesser-$$.txt') #usando tesseract

   lines   = command.readlines()
   zz = open("/tmp/ocr.log","w")
   zz.write(sys.argv[1]+"\n")
   zz.write(join(uniq(lines), " "))

   # Palabras unicas
   outputf = open(sys.argv[2],"w")
   outputf.write(join(uniq(lines), " "))
   return sys.argv[2]
else:
   print("archivo no soportado")
   zz = open("/tmp/ocr.log","w")
   zz.write("File not compatible for OCR transformation:::"+sys.argv[1]+"\n")

The script works fine, I tested in my terminal so the files are generated correctly and I can see the txt file, but in alfresco something happens when the reindexing try to trigger the script for the others file (the only type files which works fine are tiff and tif).

I know this configuration is for Alfresco versions which have the indexing Lucene engine and in Solr4 doesn't work but I cannot find information for this purpose for the alfresco community 5.x and solr4.

I anyone knows something please shared It will be very appreciated.

I apologize for my bad english

.

Regards…

Hyland Connect

OCR In Alfresco