- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2018 10:51 PM
Hello,
I'm using Alfresco 5.2 community edition on CentOS7.5 and it works well itself.
Now I trying to add OCR function to Alfresco, so I installed alfresco-simple-ocr (simple-ocr-repo-2.3.1.jar) and pdfsandwich to add function.
When I install pdfsandwich version 1.4, ruled "Extract OCR" action do works and version 1.1 PDF-file made automatically. But all pages of OCR PDF are white paper; no images, no characters.
Secondly I uninstall pdfsandwich version 1.6 insted of version 1.4, and tried again. Then ruled "Extract OCR" action DO NOT seem to be occured, and version 1.1 PDF file never made.
I tried pdfsandwich version 1.4, 1.5, 1.6 and 1.7 on comannd-line, and they works well expect version 1.7. (Version 1.7 says buggy message on command line) When use version 1.4, 1.5, 1.6, exit-code is zero.
---------------------------------------------------------------------------
RULE DEFINITION
Attached file is screen shot of rule definition. (Japanese)
When item created or input on this folder OR when item updated,
AND MIME-type is "Adobe PDF Document",
execute "Extract OCR".
- Continue on error: Checked
- Execute the rule background: Checked
---------------------------------------------------------------------------
/opt/alfresco-community/tomcat/shared/classes/alfresco-global.properties
:
:
### Alfresco Simple OCR ###
ocr.command=/usr/local/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -rgb -lang jpn
ocr.server.os=linux
---------------------------------------------------------------------------
Anyone please help me !
- Labels:
-
Alfresco Content Services
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-02-2019 09:20 AM
So, to make it works on Alfresco/Share CE 6.1.2-ga/6.1.0 I made shared volume between alfresco and ocrmypdf containers. I replace /ocr_input and /ocr_output to one directory /ocr and map it as volume for both containers.
Only one problem, asynchronous mode for rule gives me error. So I turn it off.
Angel thanks!
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr:/ocr
...
ocrmypdf:
...
volumes:
- ocr:/ocr
...
volumes:
...
ocr:
driver: local
...
bin/ocrmypdf.sh
(and remove {} from $OUTPUT_FILE_PARAM in copy output file command)
#!/bin/bash
INPUT_DIR=/ocr
OUTPUT_DIR=/ocr
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-18-2020 05:30 PM
With the approach suggested by Fedorow, I was able to make OCR work with Alfresco 6.1.0. I update ocr_input and /ocr_output to /usr/local/tomcat/ocr_input and /usr/local/tomcat/ocr_out so that alfresco container can access these folders without any access issues.
Thanks Fedorow
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
ocrmypdf:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
volumes:
...
ocr-input:
external: true
ocr-output:
external: true
...
bin/ocrmypdf.sh
#!/bin/bash
INPUT_DIR=/usr/local/tomcat/ocr_input
OUTPUT_DIR=/usr/local/tomcat/ocr_output
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
After the above changes I was able to successfully run OCR with Alfresco 6.1.
As we are running our Alfresco instance on Kubernetes and using HELM deployment, I need to configure the volumes in values.yaml file but I am not sure how to configure the volumes in values.yaml file. Any one has idea on how we need to make similar configuration in kubernetes.
Any help apprecaited.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-17-2018 01:30 PM
Hello.
Check the following link FAQ · keensoft/alfresco-simple-ocr Wiki · GitHub
Maybe that can help you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-20-2018 08:33 PM
Hi,
Thank you for your information.
I haven't read the FAQ page, so I will read the FAQ carefully and try to improve my environment.
I hope good result.
Best regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-21-2018 01:28 AM
Hello,
My problems has been partly solved.
Reading FAQ, I installed 2 jar files (simple-ocr-repo-2.3.1.jar and simple-ocr-share-2.3.1.jar) insted of simple-ocr-repo.amp. (I had used amp file)
After restarted alfresco, the "Extract PDF action" sometimes works well, and sometimes not.
When action(conversion) succeed, "tesseract" ".convers.b+" "unpaper" processes are running on "top" view.
Otherwise when action(conversion) fails, their processes appears shortly and soon disappears.
It seems that file size and number of page are unrelated.
I have no idea how to solve this problem.
Anyone know the solution. Please let me know!
------
- CentOS 7.5
- Alfresco 5.2 - community edition
- alfresco-simple-ocr 2.3.1
- pdfsandwich is 1.6 (*1)
- tesseract 3.04
(*1)
When I tested version 1.7 again, pdfsandwich says following message as before.
> "Fatal error: exception Unix.Unix_error(Unix.ENOTEMPTY, "rmdir", "/tmp/pdfsandwich_tmp2d3ca3")"
Such being the case, I use version 1.7 with "-debug" option to avoid error. (temp files should be erased manually...)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2018 06:04 AM
Try using OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) instead of pdfsandwich.
Both pdfsandwich and OCRmyPDF have some issues on CentOS (they are developed for Ubuntu), but you can use the Docker Image for OCRmyPDF available at https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-the-docker-image
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2018 03:53 AM
Thank you for your information.
I tried to install "OCRmyPDF" using Docker and partly successed to install it.
On the command line, and at the directory where the inputfile exist, conversion successfully done.
However at the othe directory, it does not work.
> ERROR - File not found - /home/hisayo-s/AAAAA.pdf
I give up my challenge.
Thanks a lot.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2018 04:17 AM
I'm currently using Docker Compose as base for my installations, so I only can give you some tips on how to configure the whole thing with Docker.
OCRmyPDF Dockerfile
FROM jbarlow83/ocrmypdf:v7.0.0USER rootRUN apt-get update && apt-get install -y openssh-serverRUN mkdir /var/run/sshdRUN echo 'root:screencast' | chpasswdRUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config# SSH login fix. Otherwise user is kicked off after loginRUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshdENV NOTVISIBLE "in users profile"RUN echo "export VISIBLE=now" >> /etc/profileCOPY assets/ssh/id_rsa.pub /root/.ssh/id_rsa.pubCOPY assets/ocr.sh /usr/bin/ocr.shRUN cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys \ && chmod 0600 /root/.ssh/authorized_keys \ && chmod +x /usr/bin/ocr.shEXPOSE 22ENTRYPOINT ["/usr/sbin/sshd", "-D"]
assets/ocr.sh
#!/bin/bashexport LC_ALL=C.UTF-8export LANG=C.UTF-8/usr/bin/ocrmypdf $@
Alfresco Dockerfile
FROM alfresco/alfresco-content-repository-community:6.0.7-gaENV LC_ALL C.UTF-8ENV LANG C.UTF-8# Extra softwareRUN set -x \ && yum install -y \ wget \ unzip \ && yum clean all# Install api-explorer webapp for REST APIRUN set -x \ && wget https://artifacts.alfresco.com/nexus/service/local/repositories/releases/content/org/alfresco/api-explorer/6.0.7-ga/api-explorer-6.0.7-ga.war -O /usr/local/tomcat/webapps/api-explorer.warARG TOMCAT_DIR=/usr/local/tomcatRUN mkdir -p $TOMCAT_DIR/amps# Install AOSRUN set -x \ && mkdir /tmp/aos \ && wget --no-check-certificate https://download.alfresco.com/cloudfront/release/community/201806-GA-build-00113/alfresco-aos-module-distributionzip-1.2.0.zip \ && unzip alfresco-aos-module-distributionzip-1.2.0.zip -d /tmp/aos \ && mv /tmp/aos/extension/* /usr/local/tomcat/shared/classes/alfresco/extension \ && mv /tmp/aos/alfresco-aos-module-1.2.0.amp amps \ && mv /tmp/aos/aos-module-license.txt licenses \ && mv /tmp/aos/_vti_bin.war /usr/local/tomcat/webapps \ && rm -rf /tmp/aos alfresco-aos-module-distributionzip-1.2.0.zip# SSH keys for ocrmypdfCOPY ssh/ /root/.ssh/# Install OCRCOPY bin/ /opt/alfresco/bin/# Configure SSH ClientRUN set -x && \ chmod +x /opt/alfresco/bin/ocrmypdf.sh && \ # Configure ssh yum install -y openssh-clients && \ echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config && \ # Alfresco Image is using POSIX as Locale (!) sed -i '/^\s*SendEnv/ d' /etc/ssh/ssh_config && \ chmod 600 /root/.ssh/id_rsa# Install modules and addonsCOPY modules/amps $TOMCAT_DIR/ampsCOPY modules/jars $TOMCAT_DIR/webapps/alfresco/WEB-INF/libRUN java -jar $TOMCAT_DIR/alfresco-mmt/alfresco-mmt*.jar install \ $TOMCAT_DIR/amps $TOMCAT_DIR/webapps/alfresco -directory -nobackup -force# Add services configuration to alfresco-global.propertiesCOPY conf/alfresco-global.properties /usr/local/tomcat/shared/classes/alfresco-global.propertiesEXPOSE 21 143 25 445 137/udp 138/udp 139
bin/ocrmypdf.sh
#!/bin/bashINPUT_DIR=/ocr_inputOUTPUT_DIR=/ocr_output# ocrmypdf hostnameOCRMYPDF_SERVER="ocrmypdf"# identify parameters, input and output filearray=( "$@" )len=${#array[@]}ARGS=${array[@]:0:$len-2}LAST_ARGS="${@: -2}"INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`# extract filenamesINPUT_FILE=$(basename "$INPUT_FILE_PARAM")OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")# SSH parametersSCP=cpSSH=sshUSER=root# copy original pdf to ocrmypdf server$SCP $INPUT_FILE_PARAM $INPUT_DIR# execute ocrmypdf program$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"# copy transformed pdf back to alfresco path$SCP $OUTPUT_DIR/$OUTPUT_FILE ${OUTPUT_FILE_PARAM}# remove temporal filesrm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
conf/alfresco-global.properties
(Only OCRmyPDF section)
## simple-ocr# https://github.com/keensoft/alfresco-simple-ocrocr.command=/opt/alfresco/bin/ocrmypdf.shocr.output.verbose=trueocr.output.file.prefix.command=# https://github.com/jbarlow83/OCRmyPDF/issues/124ocr.extra.commands=-j1 --author keensoft --rotate-pages -l spa+eng+fra --deskew --clean --skip-textocr.server.os=linux
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-29-2018 08:20 PM
Thanks you for your kindness.
However, my environment is consist of CentOS7 and Alfresco5.2 and OCRmyPDF(docker).
The scripts you have posted aren't match my environment.
As I am very new to docker, I don't know how to change the scripts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-29-2018 09:25 PM
Comparing pdfsandwich to OCRmyPDF, pdfsandwich's quality for letter recognition is better than OCRmyPDF in Japanese.
So I will focused on using pdfsandwich.
Thank you very much for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-30-2018 02:53 AM
Did you test with these instructions?
I don't know if they are still working with latest CentOS releases, but it can be an starting point.