<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: OCR pdf Scans search Alfresco Community 5 in Alfresco Archive</title>
    <link>https://connect.hyland.com/t5/alfresco-archive/ocr-pdf-scans-search-alfresco-community-5/m-p/303720#M256850</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi - we faced the same problem. Solved it by first converting the PDF to JPEG or PNG files and then running tesseract on the JPEG or PNG files.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I use the following command to burst a multi-page PDF into individual pages:&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt;pdftk test.pdf burst&lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Then convert each PDF page into JPEG:&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt; convert&amp;nbsp; -density 175 page1.pdf temp_1.jpg &lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Then run Tessearct on each JPEG, using the PDF output option.&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt; tesseract temp_1.jpg target_page_1 pdf &lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Then use PDF Unite on all the PDF files&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt;pdfunite $tempfolder_tess/*.pdf final.pdf &lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Tue, 22 Sep 2015 06:54:10 GMT</pubDate>
    <dc:creator>villdre</dc:creator>
    <dc:date>2015-09-22T06:54:10Z</dc:date>
    <item>
      <title>OCR pdf Scans search Alfresco Community 5</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/ocr-pdf-scans-search-alfresco-community-5/m-p/303719#M256849</link>
      <description>Hy,im currenty evaluating alfresco community.my key point is that i want to integrate alfresco as a document management system with the target of paperfree work.so i integrated tesseract to OCR my uploaded tiff, jpg and png files - it works fine all text is in the search index.but what i need,&amp;nbsp; woul</description>
      <pubDate>Tue, 22 Sep 2015 06:14:50 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/ocr-pdf-scans-search-alfresco-community-5/m-p/303719#M256849</guid>
      <dc:creator>r_grandits</dc:creator>
      <dc:date>2015-09-22T06:14:50Z</dc:date>
    </item>
    <item>
      <title>Re: OCR pdf Scans search Alfresco Community 5</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/ocr-pdf-scans-search-alfresco-community-5/m-p/303720#M256850</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi - we faced the same problem. Solved it by first converting the PDF to JPEG or PNG files and then running tesseract on the JPEG or PNG files.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I use the following command to burst a multi-page PDF into individual pages:&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt;pdftk test.pdf burst&lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Then convert each PDF page into JPEG:&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt; convert&amp;nbsp; -density 175 page1.pdf temp_1.jpg &lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Then run Tessearct on each JPEG, using the PDF output option.&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt; tesseract temp_1.jpg target_page_1 pdf &lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Then use PDF Unite on all the PDF files&lt;/SPAN&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt;pdfunite $tempfolder_tess/*.pdf final.pdf &lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 22 Sep 2015 06:54:10 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/ocr-pdf-scans-search-alfresco-community-5/m-p/303720#M256850</guid>
      <dc:creator>villdre</dc:creator>
      <dc:date>2015-09-22T06:54:10Z</dc:date>
    </item>
    <item>
      <title>Re: OCR pdf Scans search Alfresco Community 5</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/ocr-pdf-scans-search-alfresco-community-5/m-p/303721#M256851</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi Rene&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;I have alfresco 5.0.d installed on ubuntu 14.04 with alfresco install wizard.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;when i try to place an xml bean in shared/classes/alfresco/extension i cant login to the alfresco share and &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;i got solr errors .can you tell me how you integrated tesseract with alfresco.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Thanks &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Aadam&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 09 May 2016 23:34:53 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/ocr-pdf-scans-search-alfresco-community-5/m-p/303721#M256851</guid>
      <dc:creator>aadamnz</dc:creator>
      <dc:date>2016-05-09T23:34:53Z</dc:date>
    </item>
  </channel>
</rss>

