08-15-2018 10:51 PM
Hello,
I'm using Alfresco 5.2 community edition on CentOS7.5 and it works well itself.
Now I trying to add OCR function to Alfresco, so I installed alfresco-simple-ocr (simple-ocr-repo-2.3.1.jar) and pdfsandwich to add function.
When I install pdfsandwich version 1.4, ruled "Extract OCR" action do works and version 1.1 PDF-file made automatically. But all pages of OCR PDF are white paper; no images, no characters.
Secondly I uninstall pdfsandwich version 1.6 insted of version 1.4, and tried again. Then ruled "Extract OCR" action DO NOT seem to be occured, and version 1.1 PDF file never made.
I tried pdfsandwich version 1.4, 1.5, 1.6 and 1.7 on comannd-line, and they works well expect version 1.7. (Version 1.7 says buggy message on command line) When use version 1.4, 1.5, 1.6, exit-code is zero.
---------------------------------------------------------------------------
RULE DEFINITION
Attached file is screen shot of rule definition. (Japanese)
When item created or input on this folder OR when item updated,
AND MIME-type is "Adobe PDF Document",
execute "Extract OCR".
- Continue on error: Checked
- Execute the rule background: Checked
---------------------------------------------------------------------------
/opt/alfresco-community/tomcat/shared/classes/alfresco-global.properties
:
:
### Alfresco Simple OCR ###
ocr.command=/usr/local/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -rgb -lang jpn
ocr.server.os=linux
---------------------------------------------------------------------------
Anyone please help me !
07-02-2019 09:20 AM
So, to make it works on Alfresco/Share CE 6.1.2-ga/6.1.0 I made shared volume between alfresco and ocrmypdf containers. I replace /ocr_input and /ocr_output to one directory /ocr and map it as volume for both containers.
Only one problem, asynchronous mode for rule gives me error. So I turn it off.
Angel thanks!
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr:/ocr
...
ocrmypdf:
...
volumes:
- ocr:/ocr
...
volumes:
...
ocr:
driver: local
...
bin/ocrmypdf.sh
(and remove {} from $OUTPUT_FILE_PARAM in copy output file command)
#!/bin/bash
INPUT_DIR=/ocr
OUTPUT_DIR=/ocr
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
06-18-2020 05:30 PM
With the approach suggested by Fedorow, I was able to make OCR work with Alfresco 6.1.0. I update ocr_input and /ocr_output to /usr/local/tomcat/ocr_input and /usr/local/tomcat/ocr_out so that alfresco container can access these folders without any access issues.
Thanks Fedorow
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
ocrmypdf:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
volumes:
...
ocr-input:
external: true
ocr-output:
external: true
...
bin/ocrmypdf.sh
#!/bin/bash
INPUT_DIR=/usr/local/tomcat/ocr_input
OUTPUT_DIR=/usr/local/tomcat/ocr_output
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
After the above changes I was able to successfully run OCR with Alfresco 6.1.
As we are running our Alfresco instance on Kubernetes and using HELM deployment, I need to configure the volumes in values.yaml file but I am not sure how to configure the volumes in values.yaml file. Any one has idea on how we need to make similar configuration in kubernetes.
Any help apprecaited.
06-27-2019 11:33 AM
I try go over this solution. My deployment:
Alfresco 6.1.2-ga / Share 6.1.0
jbarlow83/ocrmypdf:v8.2.3 or v7.0.0
api-explorer-6.1.0-ea.war or 6.0.7-ga
And I have got "failed to copy".
I had file /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468.pdf but /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf don't.
My thought, I should change
INPUT_DIR=/ocr_input
OUTPUT_DIR=/ocr_output
but i don't understand how. "ocrmypdf" container don't contain this directories.
Log:
alfresco_1 | Exception in thread "defaultAsyncAction1" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
alfresco_1 | at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
alfresco_1 | at java.base/java.lang.Thread.run(Thread.java:834)
alfresco_1 | Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
alfresco_1 | ... 10 more
alfresco_1 | Caused by: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:491)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:83)
alfresco_1 | ... 11 more
alfresco_1 | Caused by: java.io.FileNotFoundException: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf (No such file or directory)
alfresco_1 | at java.base/java.io.FileInputStream.open0(Native Method)
alfresco_1 | at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
alfresco_1 | at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:485)
alfresco_1 | ... 12 more
07-02-2019 09:20 AM
So, to make it works on Alfresco/Share CE 6.1.2-ga/6.1.0 I made shared volume between alfresco and ocrmypdf containers. I replace /ocr_input and /ocr_output to one directory /ocr and map it as volume for both containers.
Only one problem, asynchronous mode for rule gives me error. So I turn it off.
Angel thanks!
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr:/ocr
...
ocrmypdf:
...
volumes:
- ocr:/ocr
...
volumes:
...
ocr:
driver: local
...
bin/ocrmypdf.sh
(and remove {} from $OUTPUT_FILE_PARAM in copy output file command)
#!/bin/bash
INPUT_DIR=/ocr
OUTPUT_DIR=/ocr
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
06-18-2020 05:30 PM
With the approach suggested by Fedorow, I was able to make OCR work with Alfresco 6.1.0. I update ocr_input and /ocr_output to /usr/local/tomcat/ocr_input and /usr/local/tomcat/ocr_out so that alfresco container can access these folders without any access issues.
Thanks Fedorow
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
ocrmypdf:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
volumes:
...
ocr-input:
external: true
ocr-output:
external: true
...
bin/ocrmypdf.sh
#!/bin/bash
INPUT_DIR=/usr/local/tomcat/ocr_input
OUTPUT_DIR=/usr/local/tomcat/ocr_output
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
After the above changes I was able to successfully run OCR with Alfresco 6.1.
As we are running our Alfresco instance on Kubernetes and using HELM deployment, I need to configure the volumes in values.yaml file but I am not sure how to configure the volumes in values.yaml file. Any one has idea on how we need to make similar configuration in kubernetes.
Any help apprecaited.
06-22-2020 07:00 AM
Hi @SriramG,
Thanks for updating us on how you resolved your issue - really helpful.
Maybe start a new thread for your question about configuring volumes?
Cheers,
08-22-2018 09:48 PM
Hello,
Thank you for your suggestion.
Unfortunately I'm not familiar with Docker.
I tried to install OCRmyPDF, but I could't.
So, I'm going to continue struggling to use pdfsandwich.
08-23-2018 01:23 AM
I don’t know if this still works, as I haven’t tested it recently, but you can find a reference for installing pdfsandwich at CentOS 7 at https://github.com/keensoft/alfresco-simple-ocr/blob/master/docker/pdfsandwich-1.6-centos-7/Dockerfi...
08-28-2018 03:07 AM
I tried to install "pdfsandwich" and "OCRmyPDF"(with docker), however I couldn't set up propery.
It is a pitty that I give up to try.
Thank you very much for giving suggestions and informations.
Explore our Alfresco products with the links below. Use labels to filter content by product module.