10-02-2020 07:17 AM
Hey guys!
I am unable to index large pdf files.
Version Alfresco community 6.1.1
Ubuntu Linux 18.04
See the error message of file catalina.out:
2020-10-01 17:03:28,779 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-41] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
2020-10-01 17:03:29,193 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-28] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
Read the documentation on the website
https://docs.alfresco.com/6.1/references/dev-extension-points-content-transformer.html
I added in alfresco-global.properties
content.transformer.PdfBox.priority = 110
content.transformer.PdfBox.extensions.pdf.txt.priority = 50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes = 25600
However, it still didn't work.
Can you help please?
With best regards,
10-02-2020 08:53 AM
Cross-posting: https://hub.alfresco.com/t5/alfresco-content-services-forum/increase-max-file-size-that-solr-indexes...
10-02-2020 09:55 AM
Hi angelborroy,
I had seen that documentation.
I applied the parameters below, in the alfresco-global.properties
I am restart Alfresco service.
It still didn't work.
Can you help?
Thanks a lot.
content.transformer.default.timeoutMs=180000
content.transformer.default.txt.*.maxSourceSizeKBytes=1048576
content.transformer.JodConverter.maxSourceSizeKBytes=102400
log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG
content.metadataExtracter.pdf.maxDocumentSizeMB=1000
content.metadataExtracter.default.timeoutMs=3625000
content.transformer.PdfBox.priority=110
content.transformer.PdfBox.extensions.pdf.txt.priority=50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes=25600
content.transformer.json2html.priority=30
content.transformer.json2html.extensions.json.html.supported=true
content.transformer.json2html.extensions.json.html.priority=30
10-02-2020 09:58 AM
Well, not really cross-posting as the OP is different. But the answer in the other thread is definitely spot on for a similar issue with transformers. What is not mentioned in the other thread is that the transformer config is also documented.
But in this case we are talking about metadata extractors, and these have separately configured limits. In fact, the PdfBox extractor is about the only one that has a configured limit via the global property content.metadataExtracter.pdf.maxDocumentSizeMB
Explore our Alfresco products with the links below. Use labels to filter content by product module.