The AutoOCR Server is integrated via REST as a dynamic configurable Alfresco document transformer. AutoOCR creates searchable PDF´s or other document formats like TXT, DOC(X), XLS(X), PPT(X), XML, RTF and HTML from image of PDF files. The OCR functions can be used via Java, JavaScript or as a document transformer. Config is done from the Share UI which also has a new document action “Transform” and gives access to all Alfresco transformers.
AutoOCR is an OCR server / service which is based on the obviously best OCR engine from Abbyy. The AutoOCR server has a REST web-serverice interface which was used to integrate it with Alfresco. AutoOCR is able to convert image- or PDF- files to searchable PDF´s. In addition to PDF other document formats like TXT, DOC(X), XLS(X), PPT(X), XML, RTF and HTML can also be created.
The configuration is simple and uses OCR profiles to summarize all possible settings. With an AMP install module the direct integration of AutoOCR to Alfresco is realized. OCR functions are available in Alfresco as a dynamically configurable transformer. Appropriate bindings allow the use of the OCR out services also from JavaScript and Java. From Alfresco 4.0, the configuration and monitoring will be done directly on the UI of the Share Administrator console.
In addition, we have extended the Alfresco share document actions with the Alfresco Transformer integration. Transformer functions are available on any document via the share interface and allow the conversion of documents into different formats. AutoOCR as Alfresco Transformer: The OCR function can be bound to a folder as an action. So if e.g. a scanned document will be placed in this folder, the processing starts automatically started and the document will be passed to the AutoOCR server. The result is a searchable PDF or other document format that can be immediately sought and found on the Alfresco full-text index.
AutoOCR JavaScript binding for Alfresco: The JavaScript API allows direct access to the AutoOCR service from Alfresco scripts. From Repository JavaScripts (Webscript controller script, scripted actions) all the features of AutoOCR API can be adressed. This API is completely independent from the integration of AutoOCR services as Alfresco Transformer.
Alfresco Share – “Transform” document action By implementing the additional “transform” document action to the Share UI you can use all your Alfresco transformes and not only the AutoOCR transformers. The “transform” action is implemented general and not only OCR specific.
Highlights / features: Direct AutoOCR integration as Alfresco transformer with REST web service interface. Separate AutoOCR service / server which does not strain the Alfresco server Based on ABBYY – the leading OCR engine Easy configuration by selecting OCR profiles – all available ABBYY OCR engine settings are combined. In addition to PDF other output formats can be generated (TXT, RTF, DOC, etc.) Dynamic transformer configuration at runtime using the Alfresco Share Admin interface. JavaScript client for the AutoOCR service, available in Alfresco repository scripts (WebScripts, actions, etc.) Java client for the AutoOCR service, for use in Java code. The Java client itself has no dependencies for Alfresco. New Share document action “Transform” enhances Share not only with OCR but with all supported transformers.
Requirements: Alfresco 4.x – dynamic configuration via Share Userinterface Alfresco 3.x – manual configuration w/o Share UI AutoOCR from Version 1.9.8 on Microsoft Windows as a service ABBYY FineReader Engine 10 (starting with 10.000 pages per month)
Because of the new version of AutoOCR 1.10.3 there are new features available for the ifresco AutoOCR Transformer for Alfresco:
* iOCR – new default OCR engine in addition to Abbyy * intelligent processing of PDF documents * Alfresco integration - ready to test without installation of an OCR server – you can use our AutoOCR Test server accessible from the internet. * New Step by Step installation and setup documentation.
iOCR - additional OCR engine available Starting with version AutoOCR version 1.10.3 the setup installs iOCR as default OCR engine which can be used standalone or in addition to the Abbyy OCR engine. iOCR has no page license limitations and is able to process PDF, TIFF or JPEG as input and can generate searchable PDF´s and TXT files.
Differences between iOCR and Abbyy
* iOCR supports not so much languages like Abbyy * no mixed language recognition - only one main language can be selected * not the same level of accuracy and recognition quality like Abbyy * no image pre-processing functions * no page orientation detection (autorotate) * Not so much functions and features to configure and input / output formats.
But iOCR is a good solution for low cost and high volume OCR recognition e.g. to extract text information from PDF´s and images to built up a full text index (e.g. Alfresco Transformer > TXT) and to create searchable PDF´s from scans with a good quality.
The best is to make tests with own documents to see which OCR engine best fits your needs. Both engines Abbyy and iOCR can be installed and used parallel - you only have to create different OCR profiles for the different settings and OCR engines. Both OCR engines can also be tested by the use of our ready to use AutoOCR test server (autoocr.may.co.at)
Intelligent PDF processing: A PDF document can contain only images from a scanner or can be created e.g. by a printer driver or by a direct PDF export. An image PDF does not contain any text and has to be OCR processed. The other “normal” PDF´s already contains text and does not need to be OCR processed. The Alfresco Transformer is not able to recognize it and to decide if a PDF has to be OCR processed or not. OCR processing costs time and resources and so starting with AutoOCR version 1.10.3 we implemented an “intelligent PDF-OCR processing”. When this option is checked on then each PDF document which is sent to the AutoOCR server is checked, and if the file already contains text - the PDF is not OCR processed. In this case the PDF or the extracted TXT data is direct sent back without OCR processing. To enable this feature the OCR profile on the AutoOCR server has to be configured for “intelligent OCR processing of PDF files”
AutoOCR Test server - ready to use With the installation of 2 AMP´s you can integrate the AutoOCR server with Alfresco. The integration works like a standard Alfresco Transformer or can also be used via Scripting or Java. The communication between AutoOCR and Alfresco is done via HTTTP(S) using REST. To make it more easy to start testing AutoOCR and the Alfresco integration you can use our ready installed and configured AutoOCR test server (autoocr.may.co.at) which is reachable over the internet and which has both OCR engines (Abbyy and iOCR) installed.