topic Re: Not able to index content of large pdfs in Alfresco Forum

Not able to index content of large pdfs

hiten_rastogi1 — Fri, 06 Jul 2018 07:48:16 GMT

Hi All,We are uploading pdf files upto 200MB in our DMS but the content are not getting indexed. After searching we came to know that the maximum limit of pdf files that can be indexed are by default 10MB so we decided to override this prop to 1 GB content.metadataExtracter.pdf.maxDocumentSizeMB=100

Re: Not able to index content of large pdfs

mehe — Fri, 06 Jul 2018 08:42:05 GMT

Just a first question: your documents are pdfs containing extractable text, not just scanned pages without ocr or protected by restricted pdf permissions?

Re: Not able to index content of large pdfs

hiten_rastogi1 — Fri, 06 Jul 2018 08:44:40 GMT

Hi Martin,

Yes, the pdf are readable not the scanned ones.

Thanks

Hiten Rastogi

Re: Not able to index content of large pdfs

mehe — Fri, 06 Jul 2018 08:47:21 GMT

any errors in the alfreso or tomcat logs - i.e. java heap space errors?

Maybe you can increase the transformation logging via log4j:

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG

log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG

Re: Not able to index content of large pdfs

hiten_rastogi1 — Fri, 06 Jul 2018 09:39:27 GMT

Hi Martin,

I enabled the logs and found out the below. Please help me in discerning the same.

log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=DEBUG

2018-07-06 15:07:03,442 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Starting metadata extraction:
reader: ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@7671e45b
2018-07-06 15:07:03,443 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Concurrent extractions : 0
2018-07-06 15:07:03,443 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] New extraction accepted. Concurrent extractions : 1
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Extraction finalized. Remaining concurrent extraction : 0
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Converted extracted raw values to system values:
Raw Properties: {pdfDFVersion=1.5, TIKA_PARSER_PARSE_SHAPES=false, comments=null, dc:subject=null, author=null, xmpTPg:NPages=84, dc:format=application/pdf; version=1.5, title=null, pdf:encrypted=false, Content-Type=application/pdf}
System Properties: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Extracted Metadata from ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
Found: {pdfDFVersion=1.5, TIKA_PARSER_PARSE_SHAPES=false, comments=null, dc:subject=null, author=null, xmpTPg:NPages=84, dc:format=application/pdf; version=1.5, title=null, pdf:encrypted=false, Content-Type=application/pdf}
Mapped and Accepted: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
2018-07-06 15:07:05,090 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Completed metadata extraction:
reader: ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@7671e45b
changed: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG
log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG

2018-07-06 15:07:19,467 INFO [web.scripts.QuickShareStatus] [http-apr-8080-exec-1] Successfully retrieved quick share information from Alfresco.
2018-07-06 15:07:21,396 INFO [web.scripts.MimetypesQuery] [http-apr-8080-exec-8] Successfully retrieved mimetypes information from Alfresco.
2018-07-06 15:07:30,029 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 pdf txt Xerox Scan_19052018115315(1)-2.pdf 39.7 MB -- index -- SolrIndexer NO transformers
2018-07-06 15:07:30,037 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 workspace://SpacesStore/66aa186a-9dc9-44aa-8680-fad46a88105f
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called

Thanks

Hiten Rastogi

Re: Not able to index content of large pdfs

afaust — Fri, 06 Jul 2018 10:00:34 GMT

You can see the problem in the log output. Indexing of the content has nothing to do with the metadata extracter, so increasing its limit did not have any impact on your problem. You need to increase the limits of the PDF => TXT transformers so they are not rejecting the PDF source document.

Check content transformation limits and content transformers (and renditions) for details on how to configure the Transformers subsystem.

The following lines in your log output show that transformers have a 25 MB source file limit and thus are not acting on a 200 MB PDF:

2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called

Re: Not able to index content of large pdfs

hiten_rastogi1 — Fri, 06 Jul 2018 10:33:58 GMT

Thanks Axel,

It is working now.

Re: Not able to index content of large pdfs

mehe — Fri, 06 Jul 2018 10:39:13 GMT

...don't forget to comment out the log4j debugging options again - this could be a bit noisy in production...

Re: Not able to index content of large pdfs

jbrasil — Fri, 02 Oct 2020 12:27:52 GMT

Hi hiten_rastogi1,
All right?
What did you do to solve this problem?
I have the same situation.
See the catalina.out log

2020-10-01 17: 03: 28,779 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-41] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
2020-10-01 17: 03: 29,193 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-28] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB

Thaks a lot!