Hyland Connect

chaitanya · ‎04-24-2012

Hi,

We have built a small application using Alfresco to abstract legal documents. To capture data after abstraction( on a page that contains text boxes, text area etc) we have written our own content model and services for the business logic. The model and the services are incorporated in to Alfresco core. No work-flow is created but logic is built around changing properties of documents. All properties are defined in the content model.

Now we want to integrate this with a OCR tool. We want Alfresco to pick up the OCR document and batch them based on certain input criteria (similar to a query), and also auto populate some of the contents from the OCR document(unstructured) in to the pages created using content model.

I want to understand if this is possible (batching, auto-population) in Alfresco, and if someone has achieved this, please share your experience on the accuracy of data that has been auto-populated and the how successful this implementation has been especially when reading documents (OCRed pdf, tiff) that are unstructured.

Thanks
Chaitanya

cnerger · ‎04-24-2012

Hi,

We have done the same kind of things but when we have design the clients tools , we have decide to do that externaly, during the pre-processing because for your need , you will add to add a Tesseract-like software to work with alfresco ( community users share different piece of code about that ) and then use the result to populate your cm .

So we have decide to do that in pre-proc , so we extract data and implement them as metadata in the pdf file , and then write a custom meta extractor to populate auto assigned Aspect defined by a folder rules.

hope it help .

Cédric

cnerger · ‎04-24-2012

Re,

i didn't read your last question , so it will depend :

   - the accuracy/training of your OCR system

   - the way you parse your research and the result ( you will need a big amount of file to test it (min 1000) to get significant result.

   -

Honestly , i've working during a year about opensource software for storage archives (in afresco ), and you should have a really good scanner or capture device!!
Get from data from an unstructured document is quiet hard , because of amount of trash.

I think you might see for an external process who is "trivial" in front of this kind of implementation into Alfresco.

my 2 cents

Cédric

samudaya · ‎07-09-2012

Hi all,

We can use good OCR enabled scanner of system to make our image based document translate to characters. However problem is most of the time we have hundreds of distributed users. In that case it is not feasible because of each and every users can not have OCR systems. So in a such a case it is very useful centralized OCR Technic available with Alfresco.

Thanks
SAMU

wmay · ‎08-01-2012

Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739

Hyland Connect

Auto population from OCR document in to Content Model