Hyland Connect

alexandra · ‎09-19-2011

We have a strange problem. We have scanned documents from our copier. These files are then opened in Acrobat where we OCR them and save them in Acrobat 7 format. After upload the thumbnail is not showing. However the document preview using the Flash component works just fine and document text content is searchable.

Some parts from the log file:

17:10:32,469 User

ystem ERROR [graphics.xobject.PDPixelMap] java.io.IOException: Unknown stream filter:COSName{JPXDecode}
java.io.IOException: Unknown stream filter:COSName{JPXDecode}
   at org.apache.pdfbox.filter.FilterManager.getFilter(FilterManager.java:103)
   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:249)
   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
   at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
   at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:211)
   at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:465)
   at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:141)
   at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:74)
   at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567)
   at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250)
   at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208)
   at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:112)
   at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:718)
   at org.alfresco.repo.content.transform.PdfBoxPdfToImageContentTransformer.transformInternal(PdfBoxPdfToImageContentTransformer.java:85)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:161)
   at org.alfresco.repo.content.transform.FailoverContentTransformer.transformInternal(FailoverContentTransformer.java:158)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:161)
   at org.alfresco.repo.content.transform.ComplexContentTransformer.transformInternal(ComplexContentTransformer.java:225)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:161)
   at org.alfresco.repo.content.ContentServiceImpl.transform(ContentServiceImpl.java:555)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
   at net.sf.acegisecurity.intercept.method.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:80)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.alfresco.repo.model.ml.MLContentInterceptor.invoke(MLContentInterceptor.java:125)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.alfresco.repo.security.permissions.impl.ExceptionTranslatorMethodInterceptor.invoke(ExceptionTranslatorMethodInterceptor.java:44)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.alfresco.repo.audit.AuditMethodInterceptor.proceedWithAudit(AuditMethodInterceptor.java:217)
   at org.alfresco.repo.audit.AuditMethodInterceptor.proceed(AuditMethodInterceptor.java:184)
   at org.alfresco.repo.audit.AuditMethodInterceptor.invoke(AuditMethodInterceptor.java:137)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:107)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
   at $Proxy45.transform(Unknown Source)
   at org.alfresco.repo.rendition.executer.AbstractTransformationRenderingEngine.render(AbstractTransformationRenderingEngine.java:71)
   at org.alfresco.repo.rendition.executer.AbstractRenderingEngine.executeRenditionImpl(AbstractRenderingEngine.java:497)
   at org.alfresco.repo.rendition.executer.AbstractRenderingEngine$2.doWork(AbstractRenderingEngine.java:429)
   at org.alfresco.repo.rendition.executer.AbstractRenderingEngine$2.doWork(AbstractRenderingEngine.java:410)
   at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:508)
   at org.alfresco.repo.rendition.executer.AbstractRenderingEngine.executeImpl(AbstractRenderingEngine.java:409)
   at org.alfresco.repo.rendition.executer.AbstractRenderingEngine.executeImpl(AbstractRenderingEngine.java:373)
   at org.alfresco.repo.action.executer.ActionExecuterAbstractBase.execute(ActionExecuterAbstractBase.java:133)
   at org.alfresco.repo.action.ActionServiceImpl.directActionExecution(ActionServiceImpl.java:749)
   at org.alfresco.repo.action.ActionServiceImpl.executeActionImpl(ActionServiceImpl.java:675)
   at org.alfresco.repo.action.AsynchronousActionExecutionQueueImpl$ActionExecutionWrapper$1$1.execute(AsynchronousActionExecutionQueueImpl.java:443)
   at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:381)
   at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:253)
   at org.alfresco.repo.action.AsynchronousActionExecutionQueueImpl$ActionExecutionWrapper$1.doWork(AsynchronousActionExecutionQueueImpl.java:452)
   at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:508)
   at org.alfresco.repo.action.AsynchronousActionExecutionQueueImpl$ActionExecutionWrapper.run(AsynchronousActionExecutionQueueImpl.java:455)
   at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:619)

afaust · ‎09-19-2011

Hello,

the version of PDFBox used in Alfresco does not support JPEG2000 images. This only occurs during conversion to Images as conversion to PDF is handled by an external tool and conversion to text ignores any images.

A newer version of PDFBox supports JPEG2000 but requires that Alfresco updates its dependency.
See https://issues.apache.org/jira/browse/PDFBOX-554

You should log this in the Alfresco JIRA as either a bug or enhancement (depends on your point of view and the phrasing of the ticket - I would log it as an enhancement named "Update PDFBox library" and use your case as an example of added functionality).

Regards

alexandra · ‎09-19-2011

Thank you for the answer! Does that mean that image creation for thumbnails are using a different component than the Flash preview for the whole document?

Since Alfresco is marketed as an Imaging Solution for things like invoices do you know how people handle this today?

Are there any OCR tools that do not use JPEG2000 or how can we solve it while waiting for an enhancement? What about the Kofax solution?

Do you think this is the same for the upcoming release of Alfresco?

afaust · ‎09-19-2011

Hello,

the upcoming release of Alfresco uses a newer version of the PDFBox library (1.5 in my snapshot from community-trunk, the most recently published by Apache being 1.6), so I'd expect this problem to be resolved then (as the PDFBox enhancement was included in 1.3.1).

I actually can't say how people other than us are handling this as information sharing on this level of detail is rather limited. What we do for OCR solutions in terms of Portals or ECM systems is to have an OCR tool extract text and use this for indexing only. We keep the scanned original file as-is and do not update it in any way. This prevents these types of issues as it achieves an independence from the way a proprietary conversion-based OCR tool handles passages it can't process.

I have no information on how Kofax may be affected or circumvents this pitfall. I also have no comprehensive information on what types of embedded objects other OCR tools use for image elements.

I'd advise to wait for the community release of Alfresco 4.0 in the next weeks and verify this.

Regards

alexandra · ‎09-19-2011

Again, thank you very much for the information. Great to have this community support. Good to hear that there are new versions of the components in the 4.0 release. Since we are still a few weeks from going into production for the initial groups of users we can afford to wait for that release.

Will try to learn more about different implementation of the PDF standard and how different OCR software handles this. Our initial solution where we use client-side Acrobat is of course a temporary one. We would like to have OCR implemented server-side with something like Kofax or Intelliant:

http://www.intelliant.fr/en/ocr-document-management-systems-tutorial1.php

I wonder if moving to a black-and-white color space in the PDF might force the compression engine to choose something else than JPEG2000…if the problem persists.

michaelk · ‎10-29-2011

As of 4.0.b this problem still exists. Scanned PDFs show a blank static preview. The swf (flash) preview does work.

michaelk · ‎10-29-2011

As of 4.0.b this problem still exists. Scanned PDFs show a blank static preview. The swf (flash) preview does work.

Hyland Connect

Thumbnails for PDF files after OCR fails