<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: OCR a scanned file and retrieve the metadata in Alfresco Forum</title>
    <link>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83735#M25599</link>
    <description>&lt;P&gt;Switch from pdfsandwich to ocrmypdf.&lt;/P&gt;
&lt;PRE style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, 'Liberation Mono', Menlo, monospace; font-size: 13.600000381469727px; margin-top: 0px; margin-bottom: 16px; word-wrap: normal; padding: 16px; overflow: auto; line-height: 1.45; background-color: #f6f8fa; border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; caret-color: #24292e; color: #24292e; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"&gt;&lt;CODE style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, 'Liberation Mono', Menlo, monospace; font-size: 13.600000381469727px; padding: 0px; margin: 0px; border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; word-break: normal; white-space: pre; border: 0px; display: inline; overflow: visible; line-height: inherit; word-wrap: normal; background-position: initial initial; background-repeat: initial initial;"&gt;ocr.command=/usr/local/bin/ocrmypdf
ocr.output.verbose=true
ocr.output.file.prefix.command=

ocr.extra.commands=--verbose 1 --force-ocr -l spa+eng+fra
ocr.server.os=linux&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This will produce more accurate results.&lt;/P&gt;</description>
    <pubDate>Thu, 19 Sep 2019 14:43:12 GMT</pubDate>
    <dc:creator>angelborroy</dc:creator>
    <dc:date>2019-09-19T14:43:12Z</dc:date>
    <item>
      <title>OCR a scanned file and retrieve the metadata</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83734#M25598</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).&lt;EM&gt;(All of this in Alfresco) &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).&lt;/P&gt;&lt;P&gt;As a first approche:&lt;/P&gt;&lt;P&gt;- For the OCR I used &lt;A href="https://github.com/keensoft/alfresco-simple-ocr" target="_blank" rel="noopener nofollow noreferrer"&gt;Alfresco Simple OCR Action&lt;/A&gt;, but the result is not very accurate (far from 100%).&lt;/P&gt;&lt;P&gt;- For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with &lt;EM&gt;document.content&lt;/EM&gt; ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.&lt;/P&gt;&lt;P&gt;So my questions are :&lt;/P&gt;&lt;P&gt;- How can I make the OCR results more accurate?&lt;/P&gt;&lt;P&gt;- How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?&lt;/P&gt;&lt;P&gt;Im using pdfsandwich, and my &lt;STRONG&gt;alfresco-global.properties&lt;/STRONG&gt; is:&lt;/P&gt;&lt;PRE&gt;ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux&lt;/PRE&gt;</description>
      <pubDate>Thu, 19 Sep 2019 09:54:03 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83734#M25598</guid>
      <dc:creator>imanez1</dc:creator>
      <dc:date>2019-09-19T09:54:03Z</dc:date>
    </item>
    <item>
      <title>Re: OCR a scanned file and retrieve the metadata</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83735#M25599</link>
      <description>&lt;P&gt;Switch from pdfsandwich to ocrmypdf.&lt;/P&gt;
&lt;PRE style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, 'Liberation Mono', Menlo, monospace; font-size: 13.600000381469727px; margin-top: 0px; margin-bottom: 16px; word-wrap: normal; padding: 16px; overflow: auto; line-height: 1.45; background-color: #f6f8fa; border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; caret-color: #24292e; color: #24292e; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"&gt;&lt;CODE style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, 'Liberation Mono', Menlo, monospace; font-size: 13.600000381469727px; padding: 0px; margin: 0px; border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; word-break: normal; white-space: pre; border: 0px; display: inline; overflow: visible; line-height: inherit; word-wrap: normal; background-position: initial initial; background-repeat: initial initial;"&gt;ocr.command=/usr/local/bin/ocrmypdf
ocr.output.verbose=true
ocr.output.file.prefix.command=

ocr.extra.commands=--verbose 1 --force-ocr -l spa+eng+fra
ocr.server.os=linux&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This will produce more accurate results.&lt;/P&gt;</description>
      <pubDate>Thu, 19 Sep 2019 14:43:12 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83735#M25599</guid>
      <dc:creator>angelborroy</dc:creator>
      <dc:date>2019-09-19T14:43:12Z</dc:date>
    </item>
    <item>
      <title>Re: OCR a scanned file and retrieve the metadata</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83736#M25600</link>
      <description>&lt;P&gt;Indeed, OCRmyPDF gives more accurate results.&lt;/P&gt;&lt;P&gt;Concerning my second question, do you have any idea how can I extract the data from the OCRed PDF file depending on the position of the data in the document. For example retrieve: Number of the invoice, the price, .... I'm really stuck and I don't know where to start, i've been googling a lot and couldn't come up with a free solution to do so from alfresco.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2019 15:59:16 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83736#M25600</guid>
      <dc:creator>imanez1</dc:creator>
      <dc:date>2019-09-27T15:59:16Z</dc:date>
    </item>
    <item>
      <title>Re: OCR a scanned file and retrieve the metadata</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83737#M25601</link>
      <description>&lt;P&gt;Cross-posted at &lt;A href="https://stackoverflow.com/questions/58116051/ocr-a-scanned-file-and-retrieve-the-metadata" target="_blank" rel="nofollow noopener noreferrer"&gt;https://stackoverflow.com/questions/58116051/ocr-a-scanned-file-and-retrieve-the-metadata&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2019 16:14:39 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/ocr-a-scanned-file-and-retrieve-the-metadata/m-p/83737#M25601</guid>
      <dc:creator>jpotts</dc:creator>
      <dc:date>2019-09-27T16:14:39Z</dc:date>
    </item>
  </channel>
</rss>

