What is your experience and opinion for simple OCR automate tool for Alfresco Community Edition 5.0.d? I want to upload just scan file and I want to search by content?
Hi, I wonder if anyone know the correct configuration for alfresco community 5.0.d for OCR integration (tesseract). I searched on web and I make my xml files for make transformations and indexing in alfresco search engine. It works for tiff files but other formats it doesn't work like (png, jpg, bmp, pdf , etc.)
I saw in the alfresco.log, shared.log and solr4.log and there's not error displayed so I can think my configuration is correct, I guess. I attached my files to this post.
#!/usr/bin/python #author: http://blyx.com/2010/11/30/integracion-de-ocr-en-alfresco/ #modified by : Uriel Efren Carballido Acosta from os import popen from string import split,join from pprint import * import re import sys def uniq(seq): # Not order preserving keys = {} for e in seq: keys[e] = 1 return keys.keys()
# Tamano maximo maxstr = 3
print(sys.argv[1])
commadArgs = "" b = None newFile = "" if ".tif" in sys.argv[1]: b = True newFile = sys.argv[1] elif ".png" or ".jpg" or ".jpeg" or ".bmp" or ".pdf" in sys.argv[1]: path = sys.argv[1].rsplit('/', 1) fileN = "" direction = "" if len(path) > 1: fileN = path[1]; direction = path[0]+"/" else: fileN = sys.argv[1]; direction = "" partF = fileN.rsplit('.', 1) newFile = "/tmp/"+partF[0]+'.tif'#direction+partF[0]+'.tif' print(newFile)
command = popen('/usr/local/bin/convert -resize 400% -type Grayscale '+sys.argv[1]+' '+newFile) #usando convert para convertir imagenes a tif b = True
# Palabras unicas outputf = open(sys.argv[2],"w") outputf.write(join(uniq(lines), " ")) return sys.argv[2] else: print("archivo no soportado") zz = open("/tmp/ocr.log","w") zz.write("File not compatible for OCR transformation:::"+sys.argv[1]+"\n")
The script works fine, I tested in my terminal so the files are generated correctly and I can see the txt file, but in alfresco something happens when the reindexing try to trigger the script for the others file (the only type files which works fine are tiff and tif).
I know this configuration is for Alfresco versions which have the indexing Lucene engine and in Solr4 doesn't work but I cannot find information for this purpose for the alfresco community 5.x and solr4.
I anyone knows something please shared It will be very appreciated.