10-24-2022 04:53 PM
I've been trying to get Alfresco to extract texts from PDF/A files larger than 25 MB and I haven't been successful. I've read countless pages of documentation, installed different versions on different operating systems. I tested several recommended settings, removed all the limits I could find. All of this without success.
Alfresco can extract text from PDF files smaller than 25MB, but none larger than that. The logs do not return any problems regarding this.
I know this from the logs, here 2 PDFs of different sizes were sent. But only one extracted the text:
Goal: Be able to search for terms that exist in PDF/A files larger than 25MB.
Some settings I've tried:
### Time out configured for all extractor and all mimetypes content.metadataExtracter.default.timeoutMs=3600000 ### Maximum size of a document to process - configured for PdfBoxMetadataExtracter , pdf files content.metadataExtracter.pdf.maxDocumentSizeMB=900 ### Maximum number of concurrent extractions - configured for PdfBoxMetadataExtracter , pdf files content.metadataExtracter.pdf.maxConcurrentExtractionsCount=15 content.transformer.default.timeoutMs=3600000 content.transformer.default.txt.*.maxSourceSizeKBytes=1073741824 content.transformer.JodConverter.maxSourceSizeKBytes=1073741824 content.transformer.JodConverter.extensions.doc.pdf.maxSourceSizeKBytes=1073741824 content.transformer.JodConverter.extensions.doc.pdf.maxSourceSizeKBytes.use.asyncRule=1073741824 content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes.use.index=1073741824 content.transformer.TikaAuto.timeoutMs.use.index=3600000 content.transformer.default.extensions.doc.txt.maxSourceSizeKBytes=1073741824 content.transformer.TikaAuto.timeoutMs=3600000 content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes=1073741824 content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes.use.webpreview=1073741824 content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes=1073741824 content.transformer.TikaAuto.extensions.pdf.txt.maxSourceSizeKBytes=1073741824
I followed the instructions on this page:
export TRANSFORMER_ROUTES_ADDITIONAL_custom="/etc/opt/alfresco/content-services/classpath/alfresco/extension/transform/pipelines/custom-pipeline-file.json"
And I created the file custom-pipeline-file.json with the most varied configurations, here are some that I tried:
{ "overrideSupported": [ { "maxSourceSizeBytes": 1073741824 } ] }
{ "transformers": [ { "transformerName": "tika", "supportedSourceAndTargetList": [ {"sourceMediaType": "application/pdf", "maxSourceSizeBytes": 1073741824, "targetMediaType": "text/plain" }, {"sourceMediaType": "application/pdf", "priority": 40, "targetMediaType": "text/plain" } ] } ] }
And after the changes, I have restarted the service using the commands below:
sudo service alfresco-content restart sudo service alfresco-search restart sudo service alfresco-tengine-aio restart
After digging deeper into this I got a configuration in the file I created custom-pipeline-file.json which gave a different result. Here's the configuration:
{ "transformers": [ { "transformerName": "PdfBox", "supportedSourceAndTargetList": [ {"sourceMediaType": "application/pdf", "maxSourceSizeBytes": 1073741824, "targetMediaType": "text/plain"} ], "transformOptions": [ "pdfboxOptions" ] }] }
And I get the error below in the logs:
2022-10-25 02:44:40.172 ERROR 41834 --- [nio-8090-exec-7] o.a.transformer.TransformController : No transforms were able to handle the request org.alfresco.transform.exceptions.TransformException: No transforms were able to handle the request at org.alfresco.transformer.AbstractTransformerController.getTransformerName(AbstractTransformerController.java:444) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3] at org.alfresco.transformer.AbstractTransformerController.getTransformerName(AbstractTransformerController.java:421) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3] at org.alfresco.transformer.AbstractTransformerController.transform(AbstractTransformerController.java:172) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na] at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na] at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:197) ~[spring-web-5.3.9.jar!/:5.3.9] at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:141) ~[spring-web-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:106) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1064) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:963) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:909) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at javax.servlet.http.HttpServlet.service(HttpServlet.java:681) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) ~[spring-webmvc-5.3.9.jar!/:5.3.9] at javax.servlet.http.HttpServlet.service(HttpServlet.java:764) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) ~[tomcat-embed-websocket-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) ~[spring-web-5.3.9.jar!/:5.3.9] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) ~[spring-web-5.3.9.jar!/:5.3.9] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:96) ~[spring-boot-actuator-2.5.4.jar!/:2.5.4] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) ~[spring-web-5.3.9.jar!/:5.3.9] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:542) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:382) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1726) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-9.0.52.jar!/:na] at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.52.jar!/:na] at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
I will really appreciate it if someone can give me a way to solve this, as I've been trying for weeks.
10-26-2022 05:38 AM
I wasn't able to apply that configuration to Transform Core AIO 2.5 / 2.6
If you upgrade Transform Core AIO to 3.0.0, you can follow this instructions:
https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md
I've tested that locally with Docker Compose and it works, this is the section for Transform Core AIO configuration:
transform-core-aio: image: alfresco/alfresco-transform-core-aio:3.0.0 mem_limit: 2048m environment: TRANSFORM_CONFIG_FILE_PDFUPDATE: "/0200-increase-pdf-max-source.json" JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 -Dserver.tomcat.threads.max=12 -Dserver.tomcat.threads.min=4 -Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR " ports: - 8090:8090 volumes: - ./0200-increase-pdf-max-source.json:/0200-increase-pdf-max-source.json
And the configuration for 0200-increase-pdf-max-source.json file is:
{ "overrideSupported": [ { "transformerName": "PdfBox", "sourceMediaType": "application/pdf", "targetMediaType": "text/plain", "maxSourceSizeBytes": -1 }, { "transformerName": "TikaAuto", "sourceMediaType": "application/pdf", "targetMediaType": "text/plain", "maxSourceSizeBytes": -1 } ] }
Hope this helps.
10-26-2022 05:38 AM
I wasn't able to apply that configuration to Transform Core AIO 2.5 / 2.6
If you upgrade Transform Core AIO to 3.0.0, you can follow this instructions:
https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md
I've tested that locally with Docker Compose and it works, this is the section for Transform Core AIO configuration:
transform-core-aio: image: alfresco/alfresco-transform-core-aio:3.0.0 mem_limit: 2048m environment: TRANSFORM_CONFIG_FILE_PDFUPDATE: "/0200-increase-pdf-max-source.json" JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 -Dserver.tomcat.threads.max=12 -Dserver.tomcat.threads.min=4 -Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR " ports: - 8090:8090 volumes: - ./0200-increase-pdf-max-source.json:/0200-increase-pdf-max-source.json
And the configuration for 0200-increase-pdf-max-source.json file is:
{ "overrideSupported": [ { "transformerName": "PdfBox", "sourceMediaType": "application/pdf", "targetMediaType": "text/plain", "maxSourceSizeBytes": -1 }, { "transformerName": "TikaAuto", "sourceMediaType": "application/pdf", "targetMediaType": "text/plain", "maxSourceSizeBytes": -1 } ] }
Hope this helps.
08-09-2024 04:23 AM
It works for me on Alfresco 7.4 in docker too.
Explore our Alfresco products with the links below. Use labels to filter content by product module.