cancel
Showing results for 
Search instead for 
Did you mean: 

Not able to index content of large pdfs in database mysql

jbrasil
Confirmed Champ
Confirmed Champ

Hey guys!
I am unable to index large pdf files.

Version Alfresco community 6.1.1
Ubuntu Linux 18.04


See the error message of file catalina.out:

2020-10-01 17:03:28,779 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-41] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
2020-10-01 17:03:29,193 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-28] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB

Read the documentation on the website
https://docs.alfresco.com/6.1/references/dev-extension-points-content-transformer.html

I added in alfresco-global.properties

content.transformer.PdfBox.priority = 110
content.transformer.PdfBox.extensions.pdf.txt.priority = 50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes = 25600

However, it still didn't work.
Can you help please?
With best regards,

3 REPLIES 3

angelborroy
Community Manager Community Manager
Community Manager

Hi angelborroy,
I had seen that documentation.
I applied the parameters below, in the alfresco-global.properties
I am restart Alfresco service.
It still didn't work.
Can you help?
Thanks a lot.

content.transformer.default.timeoutMs=180000
content.transformer.default.txt.*.maxSourceSizeKBytes=1048576
content.transformer.JodConverter.maxSourceSizeKBytes=102400

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG

content.metadataExtracter.pdf.maxDocumentSizeMB=1000
content.metadataExtracter.default.timeoutMs=3625000

content.transformer.PdfBox.priority=110
content.transformer.PdfBox.extensions.pdf.txt.priority=50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes=25600

content.transformer.json2html.priority=30
content.transformer.json2html.extensions.json.html.supported=true
content.transformer.json2html.extensions.json.html.priority=30

afaust
Legendary Innovator
Legendary Innovator

Well, not really cross-posting as the OP is different. But the answer in the other thread is definitely spot on for a similar issue with transformers. What is not mentioned in the other thread is that the transformer config is also documented.

But in this case we are talking about metadata extractors, and these have separately configured limits. In fact, the PdfBox extractor is about the only one that has a configured limit via the global property content.metadataExtracter.pdf.maxDocumentSizeMB