<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OCR in place, document versioning in Alfresco Archive</title>
    <link>https://connect.hyland.com/t5/alfresco-archive/ocr-in-place-document-versioning/m-p/301264#M254394</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;We have an OCR server (ABBYY Recognition Server) that I have integrated into Alfresco using the ABBYY SOAP interface.&amp;nbsp; We have had a PDF to TXT, JPG to TXT, TIFF to TXT, and a few other transforms implemented for some time.&amp;nbsp; I wanted to come up with a mechanism where if someone uploaded a scanned PDF (without a text layer) it was automatically versioned and replaced with a new PDF from the OCR server that has been processed.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I came up with the following JS code that basically works:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (!document.isVersioned) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; document.addAspect("cm:versionable");&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; var tempDir = companyhome.childByNamePath("tmp");&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; var updatedVersion = document.transformDocument("application/vnd.dac", tempDir);&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; var workingCopy = document.checkout();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; workingCopy.properties.content.write(updatedVersion.properties.content);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; workingCopy.properties.content.mimetype="application/pdf";&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; workingCopy.save();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; workingCopy.checkin("OCRd by ABBYY", false);&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; updatedVersion.remove();&lt;BR /&gt;&lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;SPAN&gt;As a brief explanation, I invented a mimetype "application/vnd.dac" so I could essentially have a transform in the system that sends to the OCR server and gets back an OCRd PDF.&amp;nbsp; The content of that file is then stuffed into the new version and its mimetype turned back to "application/pdf".&amp;nbsp; So the made up mimetype is only temporary.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Right now I'm running on a 3.1.2 machine, but this will eventually be moved to a 4.2 box, but my 4.2 box isn't quite ready to primetime.&amp;nbsp; &lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;If I manually run the script on an existing file, it works fine.&amp;nbsp; However I want it to run automatically.&amp;nbsp; So I set it up as a rule in a space, and that where I get some odd behavior.&amp;nbsp; If I don't run the rule in the background, the web interface makes the users wait on the OCR server.&amp;nbsp; It could be fast, or it could wait several minutes.&amp;nbsp; Not ideal for the user, but expected.&amp;nbsp; So if I set the rule to run in the background, I get some strange behavior.&amp;nbsp; Specifically, the document gets two version, but version 1.0 is the OCRd document, and the current 1.1 version is the original upload (see attached image).&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;[img]&lt;/SPAN&gt;&lt;A href="https://forums.alfresco.com/sites/forums/files/Screen%20Shot%202014-05-19%20at%204.57.04%20PM.png" rel="nofollow noopener noreferrer"&gt;https://forums.alfresco.com/sites/forums/files/Screen%20Shot%202014-05-19%20at%204.57.04%20PM.png&lt;/A&gt;&lt;SPAN&gt;[/img]&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;So I figure the rule/script is getting ahead of the upload new file process.&amp;nbsp; My first thought was to come up with a delay and some kind of test to see if the upload process is done before triggering the OCR process, however the Rhino JS engine doesn't seem to implement setInterval() and I'm not sure how else to get that effect.&amp;nbsp; I'm open to suggestions on why this is happening, and how I might be able to get around this.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;thanks,&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Geof&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Tue, 20 May 2014 14:52:12 GMT</pubDate>
    <dc:creator>abruzzi</dc:creator>
    <dc:date>2014-05-20T14:52:12Z</dc:date>
    <item>
      <title>OCR in place, document versioning</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/ocr-in-place-document-versioning/m-p/301264#M254394</link>
      <description>We have an OCR server (ABBYY Recognition Server) that I have integrated into Alfresco using the ABBYY SOAP interface.&amp;nbsp; We have had a PDF to TXT, JPG to TXT, TIFF to TXT, and a few other transforms implemented for some time.&amp;nbsp; I wanted to come up with a mechanism where if someone uploaded a scanned PD</description>
      <pubDate>Tue, 20 May 2014 14:52:12 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/ocr-in-place-document-versioning/m-p/301264#M254394</guid>
      <dc:creator>abruzzi</dc:creator>
      <dc:date>2014-05-20T14:52:12Z</dc:date>
    </item>
    <item>
      <title>Re: OCR in place, document versioning</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/ocr-in-place-document-versioning/m-p/301265#M254395</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;I think implementing OnCreateNodePolicy behavior could be one option. Inside that you can add logic when a node gets created add versioning logic. Implementing this way, you will not require to setup any business rule. You can give a try and see if this works for you or not.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Hope this helps.&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 20 May 2014 17:24:00 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/ocr-in-place-document-versioning/m-p/301265#M254395</guid>
      <dc:creator>romschn</dc:creator>
      <dc:date>2014-05-20T17:24:00Z</dc:date>
    </item>
  </channel>
</rss>

