cancel
Showing results for 
Search instead for 
Did you mean: 

How can i index scanned document imagefiles based on content

kishore
Champ in-the-making
Champ in-the-making
Hi all,

        Wht i understood abt indexing of pdf files when compared with normal text files is that it will use the text file generated from pdf file conversion for indexing purpose.But i have scanned image documents of type tiff files.
       My doubt is in this situation hw can i manage indexing or tell alfresco abt
indexing of image documents.How can i link the genereted text files generated from OCRing so that alfresco can do same procedure for indexing like its doing in case of pdf files.

        Does anybody have comments and suggestions?

Thanks
Kishore
6 REPLIES 6

pavan_kumar
Champ in-the-making
Champ in-the-making
Hi,

I too have a similar problem to that you mentioned. I have a buch of tiffs which will be stored in the repository, but I cant do a content search on these documents as there is no search index provided for this. I hope, right now we can do only search on the document name.

You are right that the documents will be indexed by converting into text and the indecies will have pointers to the documents , which will be used by a search query.

Probably we need to look in a persepctive to find a repository service that provides the indexing service and link to a document. 

I will update the thread if found a solution on this.

Any help from others is appreciated.

Cheers,
Pavan Kumar

sam69
Champ in-the-making
Champ in-the-making
Hi !

There are several OCR software which can be linked to Alfresco.
For example : http://wiki.alfresco.com/wiki/Tiger_OCR_integration
and other, but I can't found the web page.

Cheers,

Sam

kishore
Champ in-the-making
Champ in-the-making
Hi

       What i mean is already using other OCR software then hw can i link it?

Thanks
kishore

dschmalz
Champ in-the-making
Champ in-the-making
I see at least two ways of achieving your goal:

1) build an Alfresco Package Archive, which is in fact a zip file containing your PDF files + 1 file that defines the meta-data. You will have to write a converter between your "custom" index file format and Alfresco XML. Try to export a space and look at the output to get an idea

2)  create a space and define a custom action, which parses index files (.txt in your case?), reads all the referenced PDF files (that you will have put in the same space before) and attach the meta-data "on-the-fly"

Solution 1 is more elegant and you have more control in case of errors. Solution 2 is easier, take a look at Alfresco SDK "Custom Action" project to start writing custom actions.

David

johnwehall
Champ in-the-making
Champ in-the-making
i just got approach 1 working for OCR for AnyDoc 3.2 on Alfresco 2.0. i use it to import medical claims images. here's how it works:

1. a script observes the AnyDoc output directory, when it finds a TXT output file it reads the file, counts the records, and compares the count to the number/names of images in the image output directory.
2. if these match, i read in an XML template (based on the ACP XML schema for my custom claims content type) and replace it's values with the values from the AnyDoc output (once for each image).
3. then I write out the XML, copy over the image files, zip the whole thing up into an ACP, and move the ACP into a CIFS directoy that's got an action to import an ACP to my claims space.

Right now the whole thing is just a proof-of-concept. I'd like to move the whole process into the Alfresco environment and then figure out a robust way to handle and report errors.

syedmeeran
Champ in-the-making
Champ in-the-making
Hai,
     I have installed Alfresco 4.2 version in my system. I have added some documents in it and it will be listed. But how can i indexing all those documents. Pls anybody have an idea replay this. And also how to integrate OCR with this 4.2 version.

Thanks

Syed