cancel
Showing results for 
Search instead for 
Did you mean: 

show Tesseract OCR output pdf file as original tiff??

cperez
Champ in-the-making
Champ in-the-making
Hi all!!

I install tesseract on my server to convert a tif file into pdf file.

I use the next code within ubuntu terminal:


find . -maxdepth 1 -name "*.tif" -print0 | while IFS= read -r -d '' n; do
tesseract "$n" "$n" -l eng hocr;
hocr2pdf -i "$n" -n -o "$n.pdf" < "$n.html";
done


This code only show a plain text in pdf format, similar as the original but not equals.
I want to see the document(pdf) identically as the scanned image(tiff), whith all the lines and the images but I have not found any way to do this.

Can anyone tell me a way to do it with tesseract, or in the worst case with another free application?

Thanks a lot in advance!!
4 REPLIES 4

romschn
Star Collaborator
Star Collaborator
There is an open source project available freely for performing manipulation on PDF files. Here is the link to it - http://addons.alfresco.com/addons/alfresco-pdf-toolkit. I have not explored it i just read about it from other forum post, but looking at the functionality list of this toolkit it has mention of - "TIFF to PDF transformation". So seems like it may be of interest to you. Take a look at it if you can use it for your requirement or not. Hope this helps.

cperez
Champ in-the-making
Champ in-the-making
Hi romschn. Thanks for your reply but I'm using alfresco 3.0 and this addon is only for version 3.2 or newer.

romschn
Star Collaborator
Star Collaborator
You may want to use iText API to convert tiff to PDF as i think the earlier add-on i mentioned seems to be using the same API for converting tiff to PDF.

krutik_jayswal
Elite Collaborator
Elite Collaborator