cancel
Showing results for 
Search instead for 
Did you mean: 

How to automate OCR prior to uploading to Alfresco

kellerclark
Champ in-the-making
Champ in-the-making

We are new to Alfresco Community. 

When we scan paper documents they are automatically OCR'ed and we are given the option to change the filename to our chosen naming convention.  We then drag and drop the file into Alfresco.  This works very well!

When our users have digital files they want to drag and drop into Alfresco, we would like automate the following process:

1) Determining if the file needs to be OCR'ed.

2) If it needs to be OCR'ed  - do it.

3) Allow us to verify that the filename matches our naming convention.  If not, give us the option to change it.

Is there an Alfresco plugin that will do this for us?  If not, what software do you suggest we use to do this?

 We have tried OCR software and have found it to work well.  The problem is that it takes several steps (and some computer savvy) to do this manually before dropping it into Alfresco.  If they forget to OCR it first, the documents are not searchable.  We would like this process to be as simple (and foolproof) as possible.

What do you suggest?

4 REPLIES 4

calvo
Star Collaborator
Star Collaborator

Hi,

In my case, something similar to this behaviour, docs were scanned, ocr'ed and saved in a particular folder on filesystem structure (shared folder). Over that folder I had an application checking to extract metadata, change name of files and afterwards upload -using CMIS- the file to Alfresco.

In Alfresco this documents were classified using content rules and scripts depending on filename and metadata.

So, maybe you can try to develop an external application to do all funcionality you need, before upload the file to Alfresco using CMIS.

Regards,

clv

mehe
Elite Collaborator
Elite Collaborator

...there are so many possibilities.. 🙂 

do you use only one scanner or a bunch of?

do you upload to a specific "inBox" or everywhere in Alfresco?

What OCR Software/Scanner are you using? Maybe it has a kind of "scripting" possibility or an api to add some code?

The simplest approach for me is:

- use a scanner with OCR-facility
- upload the scanned documents to a "inBox" folder (using a "post-scan" script

- in Alfresco: check naming convention and "is there text to extract" via "created" rule in "inBox"

- in Alfresco: move document to a folder, depending on naming convention (or raise an exception/move to an error folder in rule, if naming isn't valid or no text could be extracted

kellerclark
Champ in-the-making
Champ in-the-making

Thank you!  That is very helpful.  I will look into CMIS, scripts and creating rules.  That sounds like it may just be the ticket.

Paper that we scan goes smoothly into Alfresco. 

It is the files, on users computers, that they drop into the system that are causing problems.  They assume that since it is a PDF it has been OCR'ed.  This may or may not be the case.  If we get a bunch of unsearchable documents into our system, users will not be able to find them later and the value of the EDMS breaks down.

Short of threatening them, how can we set it up so that only OCR'ed documents go into the system?

jpotts
World-Class Innovator
World-Class Innovator

If your OCR step can also set a property, that's probably easiest. Then you can have a rule check for the presence of that property.

Alternatively, the rule could do a transform to text. If the result is empty you know it wasn't OCR'd so you move the document to an exception folder or send an email or something.