topic Re: Auto population from OCR document in to Content Model in Alfresco Archive

Auto population from OCR document in to Content Model

chaitanya — Tue, 24 Apr 2012 09:56:17 GMT

Hi,We have built a small application using Alfresco to abstract legal documents. To capture data after abstraction( on a page that contains text boxes, text area etc) we have written our own content model and services for the business logic. The model and the services are incorporated in to Alfresc

Re: Auto population from OCR document in to Content Model

cnerger — Tue, 24 Apr 2012 12:23:55 GMT

Hi,

We have done the same kind of things but when we have design the clients tools , we have decide to do that externaly, during the pre-processing because for your need , you will add to add a Tesseract-like software to work with alfresco ( community users share different piece of code about that ) and then use the result to populate your cm .

So we have decide to do that in pre-proc , so we extract data and implement them as metadata in the pdf file , and then write a custom meta extractor to populate auto assigned Aspect defined by a folder rules.

hope it help .

Cédric

Re: Auto population from OCR document in to Content Model

cnerger — Tue, 24 Apr 2012 13:44:45 GMT

Re,

i didn't read your last question , so it will depend :

   - the accuracy/training of your OCR system

   - the way you parse your research and the result ( you will need a big amount of file to test it (min 1000) to get significant result.

   -

Honestly , i've working during a year about opensource software for storage archives (in afresco ), and you should have a really good scanner or capture device!!
Get from data from an unstructured document is quiet hard , because of amount of trash.

I think you might see for an external process who is "trivial" in front of this kind of implementation into Alfresco.

my 2 cents

Cédric

Re: Auto population from OCR document in to Content Model

samudaya — Mon, 09 Jul 2012 05:08:36 GMT

Hi all,

We can use good OCR enabled scanner of system to make our image based document translate to characters. However problem is most of the time we have hundreds of distributed users. In that case it is not feasible because of each and every users can not have OCR systems. So in a such a case it is very useful centralized OCR Technic available with Alfresco.

Thanks
SAMU

Re: Auto population from OCR document in to Content Model

wmay — Wed, 01 Aug 2012 14:26:30 GMT

Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739