<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Alfresco 7.1 Does not extract text from PDF/A of large files in Alfresco Forum</title>
    <link>https://connect.hyland.com/t5/alfresco-forum/alfresco-7-1-does-not-extract-text-from-pdf-a-of-large-files/m-p/146466#M38790</link>
    <description>&lt;P&gt;I wasn't able to apply that configuration to Transform Core AIO 2.5 / 2.6&lt;/P&gt;
&lt;P&gt;If you upgrade Transform Core AIO to 3.0.0, you can follow this instructions:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md" target="_blank" rel="noopener nofollow noreferrer"&gt;https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;I've tested that locally with Docker Compose and it works, this is the section for Transform Core AIO configuration:&lt;/P&gt;
&lt;PRE&gt;    transform-core-aio:
        image: alfresco/alfresco-transform-core-aio:3.0.0
        mem_limit: 2048m
        environment:
            TRANSFORM_CONFIG_FILE_PDFUPDATE: "/0200-increase-pdf-max-source.json"
            JAVA_OPTS: "
              -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80
              -Dserver.tomcat.threads.max=12
              -Dserver.tomcat.threads.min=4
              -Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR
            "
        ports:
          - 8090:8090
        volumes:
            - ./0200-increase-pdf-max-source.json:/0200-increase-pdf-max-source.json&lt;/PRE&gt;
&lt;P&gt;And the configuration for&amp;nbsp;0200-increase-pdf-max-source.json file is:&lt;/P&gt;
&lt;PRE&gt;{
  "overrideSupported": [
    {
      "transformerName": "PdfBox",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    },
    {
      "transformerName": "TikaAuto",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    }
  ]
}
&lt;/PRE&gt;
&lt;P&gt;Hope this helps.&lt;/P&gt;</description>
    <pubDate>Wed, 26 Oct 2022 09:38:36 GMT</pubDate>
    <dc:creator>angelborroy</dc:creator>
    <dc:date>2022-10-26T09:38:36Z</dc:date>
    <item>
      <title>Alfresco 7.1 Does not extract text from PDF/A of large files</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/alfresco-7-1-does-not-extract-text-from-pdf-a-of-large-files/m-p/146465#M38789</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I've been trying to get Alfresco to extract texts from PDF/A files larger than 25 MB and I haven't been successful. I've read countless pages of documentation, installed different versions on different operating systems. I tested several recommended settings, removed all the limits I could find. All of this without success.&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Alfresco version&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;7.1&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Search Services&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;2.0.2&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Ubuntu&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;20.04&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(compatible version according to documentation).&lt;/LI&gt;&lt;LI&gt;Installation was done through&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Ansible&lt;/STRONG&gt;. Following all the documentation.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;Alfresco can extract text from PDF files smaller than 25MB, but none larger than that. The logs do not return any problems regarding this.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;I know this from the logs, here 2 PDFs of different sizes were sent. But only one extracted the text:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pdf-test.png" style="width: 999px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://connect.hyland.com/t5/image/serverpage/image-id/1592i5203E68830B62D8D/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Goal&lt;/STRONG&gt;&lt;SPAN&gt;: Be able to search for terms that exist in PDF/A files larger than 25MB.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Some settings I've tried:&lt;/P&gt;&lt;PRE&gt;### Time out configured for all extractor and all mimetypes
content.metadataExtracter.default.timeoutMs=3600000

### Maximum size of a document to process - configured for PdfBoxMetadataExtracter , pdf files
content.metadataExtracter.pdf.maxDocumentSizeMB=900

### Maximum number of concurrent extractions - configured for PdfBoxMetadataExtracter , pdf files
content.metadataExtracter.pdf.maxConcurrentExtractionsCount=15


content.transformer.default.timeoutMs=3600000
content.transformer.default.txt.*.maxSourceSizeKBytes=1073741824
content.transformer.JodConverter.maxSourceSizeKBytes=1073741824
content.transformer.JodConverter.extensions.doc.pdf.maxSourceSizeKBytes=1073741824
content.transformer.JodConverter.extensions.doc.pdf.maxSourceSizeKBytes.use.asyncRule=1073741824
content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes.use.index=1073741824
content.transformer.TikaAuto.timeoutMs.use.index=3600000
content.transformer.default.extensions.doc.txt.maxSourceSizeKBytes=1073741824
content.transformer.TikaAuto.timeoutMs=3600000
content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes=1073741824
content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes.use.webpreview=1073741824
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes=1073741824
content.transformer.TikaAuto.extensions.pdf.txt.maxSourceSizeKBytes=1073741824&lt;/PRE&gt;&lt;P&gt;I followed the instructions on &lt;A href="https://docs.alfresco.com/transform-service/1.4/config/extend/" target="_self" rel="nofollow noopener noreferrer"&gt;this page&lt;/A&gt;:&lt;/P&gt;&lt;PRE&gt;export TRANSFORMER_ROUTES_ADDITIONAL_custom="/etc/opt/alfresco/content-services/classpath/alfresco/extension/transform/pipelines/custom-pipeline-file.json"&lt;/PRE&gt;&lt;P&gt;And I created the file &lt;STRONG&gt;custom-pipeline-file.json&amp;nbsp;&lt;/STRONG&gt;with the most varied configurations, here are some that I tried:&lt;/P&gt;&lt;PRE&gt;{
  "overrideSupported": [
    {
      "maxSourceSizeBytes": 1073741824
    }
  ]
}&lt;/PRE&gt;&lt;PRE&gt;{
  "transformers": [
    {
      "transformerName": "tika",
      "supportedSourceAndTargetList": [
        {"sourceMediaType": "application/pdf", "maxSourceSizeBytes": 1073741824, "targetMediaType": "text/plain" },
        {"sourceMediaType": "application/pdf", "priority": 40, "targetMediaType": "text/plain" }
      ]
    }
  ]
}&lt;/PRE&gt;&lt;P&gt;&lt;SPAN&gt;And after the changes, I have restarted the service using the commands below:&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;sudo service alfresco-content restart
sudo service alfresco-search restart
sudo service alfresco-tengine-aio restart&lt;/PRE&gt;&lt;P&gt;&lt;SPAN&gt;After digging deeper into this I got a configuration in the file I created&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;custom-pipeline-file.json&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;which gave a different result. Here's the configuration:&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;{
     "transformers": [
        {
            "transformerName": "PdfBox",
            "supportedSourceAndTargetList": [
                 {"sourceMediaType": "application/pdf", "maxSourceSizeBytes": 1073741824, "targetMediaType": "text/plain"}
            ],
            "transformOptions": [
                "pdfboxOptions"
            ]
        }]
}&lt;/PRE&gt;&lt;P&gt;&lt;SPAN&gt;And I get the error below in the logs:&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;2022-10-25 02:44:40.172 ERROR 41834 --- [nio-8090-exec-7] o.a.transformer.TransformController      : No transforms were able to handle the request

org.alfresco.transform.exceptions.TransformException: No transforms were able to handle the request
    at org.alfresco.transformer.AbstractTransformerController.getTransformerName(AbstractTransformerController.java:444) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
    at org.alfresco.transformer.AbstractTransformerController.getTransformerName(AbstractTransformerController.java:421) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
    at org.alfresco.transformer.AbstractTransformerController.transform(AbstractTransformerController.java:172) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na]
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
    at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:197) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:141) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:106) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1064) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:963) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:909) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:681) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:764) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) ~[tomcat-embed-websocket-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:96) ~[spring-boot-actuator-2.5.4.jar!/:2.5.4]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:542) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:382) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1726) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]&lt;/PRE&gt;&lt;P&gt;&lt;SPAN&gt;I will really appreciate it if someone can give me a way to solve this, as I've been trying for weeks.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Oct 2022 20:53:43 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/alfresco-7-1-does-not-extract-text-from-pdf-a-of-large-files/m-p/146465#M38789</guid>
      <dc:creator>RodrigoGomes</dc:creator>
      <dc:date>2022-10-24T20:53:43Z</dc:date>
    </item>
    <item>
      <title>Re: Alfresco 7.1 Does not extract text from PDF/A of large files</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/alfresco-7-1-does-not-extract-text-from-pdf-a-of-large-files/m-p/146466#M38790</link>
      <description>&lt;P&gt;I wasn't able to apply that configuration to Transform Core AIO 2.5 / 2.6&lt;/P&gt;
&lt;P&gt;If you upgrade Transform Core AIO to 3.0.0, you can follow this instructions:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md" target="_blank" rel="noopener nofollow noreferrer"&gt;https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;I've tested that locally with Docker Compose and it works, this is the section for Transform Core AIO configuration:&lt;/P&gt;
&lt;PRE&gt;    transform-core-aio:
        image: alfresco/alfresco-transform-core-aio:3.0.0
        mem_limit: 2048m
        environment:
            TRANSFORM_CONFIG_FILE_PDFUPDATE: "/0200-increase-pdf-max-source.json"
            JAVA_OPTS: "
              -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80
              -Dserver.tomcat.threads.max=12
              -Dserver.tomcat.threads.min=4
              -Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR
            "
        ports:
          - 8090:8090
        volumes:
            - ./0200-increase-pdf-max-source.json:/0200-increase-pdf-max-source.json&lt;/PRE&gt;
&lt;P&gt;And the configuration for&amp;nbsp;0200-increase-pdf-max-source.json file is:&lt;/P&gt;
&lt;PRE&gt;{
  "overrideSupported": [
    {
      "transformerName": "PdfBox",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    },
    {
      "transformerName": "TikaAuto",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    }
  ]
}
&lt;/PRE&gt;
&lt;P&gt;Hope this helps.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Oct 2022 09:38:36 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/alfresco-7-1-does-not-extract-text-from-pdf-a-of-large-files/m-p/146466#M38790</guid>
      <dc:creator>angelborroy</dc:creator>
      <dc:date>2022-10-26T09:38:36Z</dc:date>
    </item>
    <item>
      <title>Re: Alfresco 7.1 Does not extract text from PDF/A of large files</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/alfresco-7-1-does-not-extract-text-from-pdf-a-of-large-files/m-p/146467#M38791</link>
      <description>&lt;P&gt;It works for me on Alfresco 7.4 in docker too.&lt;/P&gt;</description>
      <pubDate>Fri, 09 Aug 2024 08:23:53 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/alfresco-7-1-does-not-extract-text-from-pdf-a-of-large-files/m-p/146467#M38791</guid>
      <dc:creator>Dagoo</dc:creator>
      <dc:date>2024-08-09T08:23:53Z</dc:date>
    </item>
  </channel>
</rss>

