07-06-2018 03:48 AM
Hi All,
We are uploading pdf files upto 200MB in our DMS but the content are not getting indexed.
After searching we came to know that the maximum limit of pdf files that can be indexed are by default 10MB so we decided to override this prop to 1 GB content.metadataExtracter.pdf.maxDocumentSizeMB=1000 we then deleted our old indexes and restarted the DMS but no effect.
Then we also find out that the default timeout for metaDataExtractor was 20 milliseconds so we changed that to ~1 hour content.metadataExtracter.default.timeoutMs=3625000 but still no change.
Please guide what else needs to be done to get the index correctly.
Thanks
Hiten Rastogi
07-06-2018 06:00 AM
You can see the problem in the log output. Indexing of the content has nothing to do with the metadata extracter, so increasing its limit did not have any impact on your problem. You need to increase the limits of the PDF => TXT transformers so they are not rejecting the PDF source document.
Check content transformation limits and content transformers (and renditions) for details on how to configure the Transformers subsystem.
The following lines in your log output show that transformers have a 25 MB source file limit and thus are not acting on a 200 MB PDF:
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called
07-06-2018 04:42 AM
Just a first question: your documents are pdfs containing extractable text, not just scanned pages without ocr or protected by restricted pdf permissions?
07-06-2018 04:44 AM
Hi Martin,
Yes, the pdf are readable not the scanned ones.
Thanks
Hiten Rastogi
07-06-2018 04:47 AM
any errors in the alfreso or tomcat logs - i.e. java heap space errors?
Maybe you can increase the transformation logging via log4j:
log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG
log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG
07-06-2018 05:39 AM
Hi Martin,
I enabled the logs and found out the below. Please help me in discerning the same.
log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=DEBUG
2018-07-06 15:07:03,442 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Starting metadata extraction:
reader: ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@7671e45b
2018-07-06 15:07:03,443 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Concurrent extractions : 0
2018-07-06 15:07:03,443 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] New extraction accepted. Concurrent extractions : 1
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Extraction finalized. Remaining concurrent extraction : 0
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Converted extracted raw values to system values:
Raw Properties: {pdfDFVersion=1.5, TIKA_PARSER_PARSE_SHAPES=false, comments=null, dc:subject=null, author=null, xmpTPg:NPages=84, dc:format=application/pdf; version=1.5, title=null, pdf:encrypted=false, Content-Type=application/pdf}
System Properties: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Extracted Metadata from ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
Found: {pdfDFVersion=1.5, TIKA_PARSER_PARSE_SHAPES=false, comments=null, dc:subject=null, author=null, xmpTPg:NPages=84, dc:format=application/pdf; version=1.5, title=null, pdf:encrypted=false, Content-Type=application/pdf}
Mapped and Accepted: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
2018-07-06 15:07:05,090 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Completed metadata extraction:
reader: ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@7671e45b
changed: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG
log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG
2018-07-06 15:07:19,467 INFO [web.scripts.QuickShareStatus] [http-apr-8080-exec-1] Successfully retrieved quick share information from Alfresco.
2018-07-06 15:07:21,396 INFO [web.scripts.MimetypesQuery] [http-apr-8080-exec-8] Successfully retrieved mimetypes information from Alfresco.
2018-07-06 15:07:30,029 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 pdf txt Xerox Scan_19052018115315(1)-2.pdf 39.7 MB -- index -- SolrIndexer NO transformers
2018-07-06 15:07:30,037 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 workspace://SpacesStore/66aa186a-9dc9-44aa-8680-fad46a88105f
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called
Thanks
Hiten Rastogi
07-06-2018 06:00 AM
You can see the problem in the log output. Indexing of the content has nothing to do with the metadata extracter, so increasing its limit did not have any impact on your problem. You need to increase the limits of the PDF => TXT transformers so they are not rejecting the PDF source document.
Check content transformation limits and content transformers (and renditions) for details on how to configure the Transformers subsystem.
The following lines in your log output show that transformers have a 25 MB source file limit and thus are not acting on a 200 MB PDF:
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called
07-06-2018 06:33 AM
Thanks Axel,
It is working now.
07-06-2018 06:39 AM
...don't forget to comment out the log4j debugging options again - this could be a bit noisy in production...
10-02-2020 08:27 AM
Hi hiten_rastogi1,
All right?
What did you do to solve this problem?
I have the same situation.
See the catalina.out log
2020-10-01 17: 03: 28,779 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-41] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
2020-10-01 17: 03: 29,193 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-28] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
Thaks a lot!
Explore our Alfresco products with the links below. Use labels to filter content by product module.