cancel
Showing results for 
Search instead for 
Did you mean: 

content transformer and pdfbox

vincent-kali
Star Contributor
Star Contributor
Hi all,
We're running alfresco 4.2e, and facing a very high CPU load. After
some investigation, we found that transformer.PdfBox is causing this anormal load.

It seems that pdfbox can be upgraded. Could somebody explain me how to do that ?
Is pdfbox.jar embedded in alfresco ? Do we need to recompile the whole project ?

Any other suggestion ?

Thanks,
Vincent

3 REPLIES 3

nickburch
Confirmed Champ
Confirmed Champ
Alfresco 4.2e uses PDFBox 1.8.2, while the new 5.0a uses PDFBox 1.8.4, so one easy option is just to upgrade to a newer Alfresco release!

Otherwise, to upgrade Apache PDFBox you generally need to upgrade Apache Tika too, and that means upgrading some of the other dependencies as well. For extra fun, Alfresco are currently shipping custom patched versions of Apache Tika, so you might be better off grabbing the newer Tika + friends jars out of 5.0.a or HEAD rather than trying to find + replace the jars yourself. For the list of the dependency jars, you'll want to look in the Tika Core and Tika Parser poms.

I'd probably suggest the upgrade to 5.0.a, unless you have very strong reasons to stick with 4.2

Hi nickburch,

Thanks for your feedback… this doesn't look straightforward.
Intially we had two major issues using PDFBox 1.8.2:
- High CPU usage
- some parts of pdf string tables are not converted to text, and then not indexed by lucene.

Any advice ?

Thanks,
Vincent

heiko_robert
Star Collaborator
Star Collaborator
Has anybody tried to replace the jars in 4.2.e/f? Is it worth to test?
It's really painfull to get block whole alfresco again and again.
s. also
https://issues.alfresco.com/jira/browse/MNT-11350
https://issues.alfresco.com/jira/browse/MNT-11666
https://issues.apache.org/jira/browse/PDFBOX-1585