topic Re: OCR a scanned file and retrieve the metadata in Alfresco Forum

OCR a scanned file and retrieve the metadata

imanez1 — Thu, 19 Sep 2019 09:54:03 GMT

Hello,

I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)

Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).

As a first approche:

- For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).

- For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.

So my questions are :

- How can I make the OCR results more accurate?

- How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?

Im using pdfsandwich, and my alfresco-global.properties is:

ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux

Re: OCR a scanned file and retrieve the metadata

angelborroy — Thu, 19 Sep 2019 14:43:12 GMT

Switch from pdfsandwich to ocrmypdf.

ocr.command=/usr/local/bin/ocrmypdf
ocr.output.verbose=true
ocr.output.file.prefix.command=

ocr.extra.commands=--verbose 1 --force-ocr -l spa+eng+fra
ocr.server.os=linux

This will produce more accurate results.

Re: OCR a scanned file and retrieve the metadata

imanez1 — Fri, 27 Sep 2019 15:59:16 GMT

Indeed, OCRmyPDF gives more accurate results.

Concerning my second question, do you have any idea how can I extract the data from the OCRed PDF file depending on the position of the data in the document. For example retrieve: Number of the invoice, the price, .... I'm really stuck and I don't know where to start, i've been googling a lot and couldn't come up with a free solution to do so from alfresco.

Re: OCR a scanned file and retrieve the metadata

jpotts — Fri, 27 Sep 2019 16:14:39 GMT

Cross-posted at https://stackoverflow.com/questions/58116051/ocr-a-scanned-file-and-retrieve-the-metadata