08-15-2018 07:51 PM
Hello,
I'm using Alfresco 5.2 community edition on CentOS7.5 and it works well itself.
Now I trying to add OCR function to Alfresco, so I installed alfresco-simple-ocr (simple-ocr-repo-2.3.1.jar) and pdfsandwich to add function.
When I install pdfsandwich version 1.4, ruled "Extract OCR" action do works and version 1.1 PDF-file made automatically. But all pages of OCR PDF are white paper; no images, no characters.
Secondly I uninstall pdfsandwich version 1.6 insted of version 1.4, and tried again. Then ruled "Extract OCR" action DO NOT seem to be occured, and version 1.1 PDF file never made.
I tried pdfsandwich version 1.4, 1.5, 1.6 and 1.7 on comannd-line, and they works well expect version 1.7. (Version 1.7 says buggy message on command line) When use version 1.4, 1.5, 1.6, exit-code is zero.
---------------------------------------------------------------------------
RULE DEFINITION
Attached file is screen shot of rule definition. (Japanese)
When item created or input on this folder OR when item updated,
AND MIME-type is "Adobe PDF Document",
execute "Extract OCR".
- Continue on error: Checked
- Execute the rule background: Checked
---------------------------------------------------------------------------
/opt/alfresco-community/tomcat/shared/classes/alfresco-global.properties
:
:
### Alfresco Simple OCR ###
ocr.command=/usr/local/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -rgb -lang jpn
ocr.server.os=linux
---------------------------------------------------------------------------
Anyone please help me !
07-02-2019 06:20 AM
So, to make it works on Alfresco/Share CE 6.1.2-ga/6.1.0 I made shared volume between alfresco and ocrmypdf containers. I replace /ocr_input and /ocr_output to one directory /ocr and map it as volume for both containers.
Only one problem, asynchronous mode for rule gives me error. So I turn it off.
Angel thanks!
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr:/ocr
...
ocrmypdf:
...
volumes:
- ocr:/ocr
...
volumes:
...
ocr:
driver: local
...
bin/ocrmypdf.sh
(and remove {} from $OUTPUT_FILE_PARAM in copy output file command)
#!/bin/bash
INPUT_DIR=/ocr
OUTPUT_DIR=/ocr
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
06-18-2020 02:30 PM
With the approach suggested by Fedorow, I was able to make OCR work with Alfresco 6.1.0. I update ocr_input and /ocr_output to /usr/local/tomcat/ocr_input and /usr/local/tomcat/ocr_out so that alfresco container can access these folders without any access issues.
Thanks Fedorow
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
ocrmypdf:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
volumes:
...
ocr-input:
external: true
ocr-output:
external: true
...
bin/ocrmypdf.sh
#!/bin/bash
INPUT_DIR=/usr/local/tomcat/ocr_input
OUTPUT_DIR=/usr/local/tomcat/ocr_output
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
After the above changes I was able to successfully run OCR with Alfresco 6.1.
As we are running our Alfresco instance on Kubernetes and using HELM deployment, I need to configure the volumes in values.yaml file but I am not sure how to configure the volumes in values.yaml file. Any one has idea on how we need to make similar configuration in kubernetes.
Any help apprecaited.
08-17-2018 10:30 AM
Hello.
Check the following link FAQ · keensoft/alfresco-simple-ocr Wiki · GitHub
Maybe that can help you.
08-20-2018 05:33 PM
Hi,
Thank you for your information.
I haven't read the FAQ page, so I will read the FAQ carefully and try to improve my environment.
I hope good result.
Best regards,
08-20-2018 10:28 PM
Hello,
My problems has been partly solved.
Reading FAQ, I installed 2 jar files (simple-ocr-repo-2.3.1.jar and simple-ocr-share-2.3.1.jar) insted of simple-ocr-repo.amp. (I had used amp file)
After restarted alfresco, the "Extract PDF action" sometimes works well, and sometimes not.
When action(conversion) succeed, "tesseract" ".convers.b+" "unpaper" processes are running on "top" view.
Otherwise when action(conversion) fails, their processes appears shortly and soon disappears.
It seems that file size and number of page are unrelated.
I have no idea how to solve this problem.
Anyone know the solution. Please let me know!
------
- CentOS 7.5
- Alfresco 5.2 - community edition
- alfresco-simple-ocr 2.3.1
- pdfsandwich is 1.6 (*1)
- tesseract 3.04
(*1)
When I tested version 1.7 again, pdfsandwich says following message as before.
> "Fatal error: exception Unix.Unix_error(Unix.ENOTEMPTY, "rmdir", "/tmp/pdfsandwich_tmp2d3ca3")"
Such being the case, I use version 1.7 with "-debug" option to avoid error. (temp files should be erased manually...)
08-22-2018 03:04 AM
Try using OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) instead of pdfsandwich.
Both pdfsandwich and OCRmyPDF have some issues on CentOS (they are developed for Ubuntu), but you can use the Docker Image for OCRmyPDF available at https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-the-docker-image
Explore our Alfresco products with the links below. Use labels to filter content by product module.