cancel
Showing results for 
Search instead for 
Did you mean: 

Nuxeo-Platform-OCR Question

Soni_
Champ on-the-rise
Champ on-the-rise

Hi:

I'm trying to install 'Nuxeo-platform-ocr' (https://github.com/nuxeo/nuxeo-platform-ocr) , but I do not know where to locate the file 'content_in_doc', so that Nuxeo can use to analyze.

I have followed this manual https://github.com/nuxeo/nuxeo-platform-ocr, but not clear where to locate.

I'm using Ubuntu 10.11 + Tesseract + 3 + Nuxeo Olena (scribe)

Could you tell me where I locate the file 'content_in_doc'?

Thanks, and regards.

30 REPLIES 30

Olivier_Grisel
Star Contributor
Star Contributor

I just tried to build against the latest stable version (2.0) of Olena and it seems to work fine. I have updated the README.md of nuxeo-platform-ocr to point to the right source archive.

Beware that the build of olena is has several steps and 2 calls to make in 2 separate folders (the build root and the scribo/src subfolder):

$ wget http://www.lrde.epita.fr/dload/olena/2.0/olena-2.0.tar.bz2
$ tar jxvf olena-*.tar.bz2
$ cd olena-2.0/
$ mkdir _build
$ cd _build
$ ../configure && make
$ cd scribo/src
$ make

The scribo/src should then hold the content_in_doc binary. If not check any error messages in the output the build. Maybe your are missing the development headers for tesseract? Have you installed tesseract 3 from the source tarball and installed it system-wide using sudo make install?

I ve compiled Olena 1.0 with Tesseract 3.0 with no problem

As written in the README.md file and as I already answered you have to run make in the $SOURCE_ROOT/_build/scribo/src folder as well and the content_in_doc binary will be created there too.

I am running make inside $SOURCE_ROOT/_build/scribo/src folder

I just tried from scratch in a new empty folder from the original tarball and the content_in_doc related lines in the Makefile are not commented out and the binary is built successfully. I suspect that in your case the configure script did not detect some missing dependency

Right now I'm trying to compile Olena/content_in_doc on Debian Squeeze. I had to install the following packages to make content_in_doc enabled in Makefiles

In my case I built tesseract 3 from the source tarball (as not yet available in ubuntu, I don't know for debian). tesseract 3 gives much better results than tesseract 2 in practice.

Here I did it using Squeeze's own Tesseract.

Yet another try. Did it by using (hand-compiled) libleptonica and libtesseract (3). Apparently, Olena 2 only detects the latter when it's compiled "--with-multiple-libraries" (so that it has libtesseract_api.so and so on, and not just libtesseract.so).