cancel
Showing results for 
Search instead for 
Did you mean: 

Error ocrmypdf in Alfresco Linux version 6.1

jbrasil
Confirmed Champ
Confirmed Champ

Hey guys,
It is not generating the ocr within the Alfresco platform.

See the logs below:

tail -f /opt/alfresco/tomcat/logs/catalina.out

command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
... 10 more
Caused by: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)


root@pmituiutaba:/opt/alfresco/logs# gs --version
9.26

root@pmituiutaba:/opt/alfresco/logs# pip3 --version
pip 20.2.3 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)

root@pmituiutaba:/opt/alfresco/logs# tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX
Found SSE

root@pmituiutaba:/opt/alfresco/logs# ocrmypdf --version
6.1.2

root@pmituiutaba:/opt/alfresco/logs# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

cat alfresco.log | grep -i "Current version"
2020-09-15 00:04:09,348 INFO [org.alfresco.service.descriptor.DescriptorService] [localhost-startStop-1] Alfresco Content Services started (Community). Current version: 6.1.1 (r9d03d2fd-b168) schema 12,001. Originally installed version: 6.1.1 (r9d03d2fd-b168) schema 12,001.

cat /etc/sudoers
#
# This file MUST be edited with the 'visudo' command as root.
#
# Please consider adding local content in /etc/sudoers.d/ instead of
# directly modifying this file.
#
# See the man page for details on how to write a sudoers file.
#
Defaults env_reset
Defaults mail_badpass
Defaults secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"

# Host alias specification

# User alias specification

# Cmnd alias specification

# User privilege specification
root ALL=(ALL:ALL) ALL
alfresco ALL=(ALL) NOPASSWD: ALL

# Members of the admin group may gain root privileges
%admin ALL=(ALL) ALL

# Allow members of group sudo to execute any command
%sudo ALL=(ALL:ALL) ALL

# See sudoers(5) for more information on "#include" directives:

#includedir /etc/sudoers.d

cat /opt/alfresco/tomcat/shared/classes/alfresco-global.properties | grep -i "ocr"
#### OCR mit OCRmyPDF
ocr.command=/opt/alfresco/scripts/ocrmypdf.sh
ocr.output.verbose=false
ocr.output.file.prefix.command=
ocr.extra.commands=--verbose 1 --force-ocr -l por+eng
ocr.server.os=linux

/opt/alfresco/modules/share# l
total 12K
-rw-r--r-- 1 root root 12K Sep 14 18:48 simple-ocr-share-2.3.1.jar

/opt/alfresco/modules/platform# l
total 28K
-rw-r--r-- 1 root root 28K Sep 14 18:48 simple-ocr-repo-2.3.1.jarimage

Can you help please?
Thanks a lot!

6 REPLIES 6

kaynezhang
World-Class Innovator
World-Class Innovator
You can  tested the command directly in the shell using an exmaple file  and see what happens
/opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /***/***src.pdf  /***/***target.pdf 

Hi kaynezhang,
Running through the linux shell, it worked perfectly.
See the log:

./ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /home/jbrasil/teste33.pdf /home/jbrasil/teste33-v2.pdf
DEBUG - ocrmypdf 6.1.2
DEBUG - tesseract 4.0.0-beta.1
DEBUG - qpdf 8.0.2
DEBUG - PyMuPDF not installed
DEBUG - os.symlink(/home/jbrasil/teste33.pdf, /tmp/com.github.ocrmypdf.l22048pv/origin)

________________________________________
Tasks which will be run:


Task enters queue = 'ocrmypdf.pipeline.triage'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/origin, /tmp/com.github.ocrmypdf.l22048pv/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_and_parse_pdf'
DEBUG - Beginning qpdf repair...
DEBUG - Repair OK; beginning parse...
DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_and_parse_pdf'
Task enters queue = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.split_page'
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.split_page'
Task enters queue = 'ocrmypdf.pipeline.ocr_or_skip'
INFO - 1: page already has text! – rasterizing text and running OCR anyway
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.ocr_or_skip'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'


WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

DEBUG - Rasterize 000001.ocr.oriented.pdf with png16m
DEBUG -
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
DEBUG - 1: convert
DEBUG - ['tesseract', '-l', 'por+eng', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png', '/tmp/com.github.ocrmypdf.l22048pv/000001.text', 'pdf', 'txt']
DEBUG - 1: convert done
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
DEBUG - Final pages: /tmp/com.github.ocrmypdf.l22048pv/000001.rendered.pdf
/tmp/com.github.ocrmypdf.l22048pv/pdfa.ps
DEBUG - Ghostscript had to remove PDF 'overprinting' from the input file to complete PDF/A conversion.
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 3.38× larger than the input file.
Possible reasons for this include:
The optional dependency PyMuPDF is not installed.
The argument --force-ocr was issued.


DEBUG - <PdfInfo('...'), page count=1>

l /home/jbrasil/
total 116K
-rw-r--r-- 1 root root 26K Sep 14 18:56 teste33.pdf
-rw-r--r-- 1 root root 86K Sep 15 09:01 teste33-v2.pdf

It just doesn't generate through the Alfresco platform.

Can you help?
Thank you.

kaynezhang
World-Class Innovator
World-Class Innovator

How did you install alfresco ? did you install it manually or install using docker?

Hi kaynezhang,
I installed using the loftuxab script.
alfinstall.sh

https://github.com/loftuxab/alfresco-ubuntu-install

I have always installed this script.
I never had a problem. First time this type of error occurs.
Anything else that needs to be investigated?

Thanks a lot.

kaynezhang
World-Class Innovator
World-Class Innovator

Your  installation is ok ,the error seems python script can't load tesseract lib correctly. But you can run the command successfully directly int shell,very strange.

Hi kaynezhang,
Very strange. We have other servers with Alfrescom running the same version.
See the script:

/ opt / alfresco / scripts

cat ocrmypdf.sh
#! / usr / bin / env bash
# set -o xtrace # Uncomment for debugging / troubleshooting
sudo ocrmypdf "$ @"

Theoretically, it is right.
I do not know what happened...
Thanks.