cancel
Showing results for 
Search instead for 
Did you mean: 

Can PDFDataProvider create a text PDF?

Michael_Butt1
Star Contributor
Star Contributor

I have a Unity Script that takes a text document in OnBase and writes it to a PDF using the  PDFDataProvider.  It appears that the resulting PDF is an image based PDF or at least not text searchable. It's also quite a bit larger than the original text file imported into OnBase.  Can the PDFDataProvider write out a text based PDF?  I can't seem to find a method in the SDK to do so.      

My current code is:

Dim pdfProvider as PDFDataProvider = app.Core.Retrieval.PDF

Using pageData as PageData = pdfProvider.GetDocument(rendition)
     Using stream as Stream = pageData.Stream
          Utility.WriteStreamToFile(stream, path)
    End Using
End Using

I didn't see anything in PDFGetDocumentProperties that seemed to do this.

2 ACCEPTED ANSWERS

Scott_McLean
Elite Collaborator
Elite Collaborator

Natively, the PDFDataProvider produces image-only PDF files when converting supported file types. There is no API setting to modify this.

As James mentioned, if you're licensed for full-page OCR (batch or ad-hoc), you can create a text-searchable PDF rendition that way but not directly through the API. You can, however, create a scan batch through the API and push the document into a scan queue configured for OCR in order to automate part of the process. Likewise, you could use workflow (with the "Queue Document for OCR" action) to place the document in the "Awaiting Ad Hoc OCR" queue)

View answer in original post

Michael_Butt1
Star Contributor
Star Contributor

I'm marking Scott's message from the comment to my post as an answer:  No, the PDFDataProvider does not write text PDFs. full-page OCR can be used through a scan queue to accomplish text searching.

View answer in original post

4 REPLIES 4

James_Chauncey
Confirmed Champ
Confirmed Champ
Michael,
I am unfamiliar with the API at this time so I am not sure if this will resolve your issue, but to make a PDF Text Searchable, I recently discovered that I had to create / modify an OCR Format and change the Output Format to PDF (Image with Searchable Text) and then apply that format to the Document Type. This will make any rendition created into a searchable PDF.
Good luck
James

Scott_McLean
Elite Collaborator
Elite Collaborator

Natively, the PDFDataProvider produces image-only PDF files when converting supported file types. There is no API setting to modify this.

As James mentioned, if you're licensed for full-page OCR (batch or ad-hoc), you can create a text-searchable PDF rendition that way but not directly through the API. You can, however, create a scan batch through the API and push the document into a scan queue configured for OCR in order to automate part of the process. Likewise, you could use workflow (with the "Queue Document for OCR" action) to place the document in the "Awaiting Ad Hoc OCR" queue)

Michael_Butt1
Star Contributor
Star Contributor
Thank you both, that answers my question. I will look into using an external library. Part of the issue is file size. The PDFs nearly double in size when printing from a text file into an image PDF. I'm trying to get them into a size that can be emailed per the customer's request .

Michael_Butt1
Star Contributor
Star Contributor

I'm marking Scott's message from the comment to my post as an answer:  No, the PDFDataProvider does not write text PDFs. full-page OCR can be used through a scan queue to accomplish text searching.