nuxeo-plattform-ocr and image pdfs

rbahntje_Bahntj — Thu, 15 Sep 2011 10:37:39 GMT

I have installed the nuxeo-plattform-ocr plugin ( https://github.com/nuxeo/nuxeo-platform-ocr#readme ) and is working very nice, but I am not able to run the OCR inside image PDFs.

Is there any plugin to do this?

Regards

Ruben Bahntje Ushuaia - Argentina

Re: nuxeo-plattform-ocr and image pdfs

Olivier_Grisel — Thu, 15 Sep 2011 18:37:59 GMT

Great to learn that you could install this addon successfully despite the list of non trivial dependencies to build from source 🙂

To make it work on PDF files it would require to first extract the image files (e.g. JPEG files) included inside. If you are a Java developer, this should be doable with the http://pdfbox.apache.org/ , e.g. you can take class from the PDFBox source tree as an example.

The source code of the OCR plugin is not too complicated to dive into and I can probably assist you on the nuxeo-dev mailing list or better directly through the inline review system on pull request directly on github.

topic Re: nuxeo-plattform-ocr and image pdfs in Nuxeo Forum

nuxeo-plattform-ocr and image pdfs

Re: nuxeo-plattform-ocr and image pdfs