09-19-2019 05:54 AM
Hello,
I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)
Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).
As a first approche:
- For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).
- For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.
So my questions are :
- How can I make the OCR results more accurate?
- How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?
Im using pdfsandwich, and my alfresco-global.properties is:
ocr.command=/usr/bin/pdfsandwich ocr.output.verbose=true ocr.output.file.prefix.command=-o ocr.extra.commands=-verbose -lang eng ocr.server.os=linux
09-19-2019 10:43 AM
Switch from pdfsandwich to ocrmypdf.
ocr.command=/usr/local/bin/ocrmypdf
ocr.output.verbose=true
ocr.output.file.prefix.command=
ocr.extra.commands=--verbose 1 --force-ocr -l spa+eng+fra
ocr.server.os=linux
This will produce more accurate results.
09-27-2019 11:59 AM
Indeed, OCRmyPDF gives more accurate results.
Concerning my second question, do you have any idea how can I extract the data from the OCRed PDF file depending on the position of the data in the document. For example retrieve: Number of the invoice, the price, .... I'm really stuck and I don't know where to start, i've been googling a lot and couldn't come up with a free solution to do so from alfresco.
09-27-2019 12:14 PM
Explore our Alfresco products with the links below. Use labels to filter content by product module.