Hyland Connect

dranakan · ‎12-24-2009

Hello,

I get a error message when I am uploading some PDF in Alfresco (with Mysql, 3.2r2)

…
ERROR [pdfbox.filter.FlateFilter] Stop reading corrupt stream
….‍‍‍

Looking in the src of pdfbox I have found :


…
}
                    catch (OutOfMemoryError exception) 
                    {
                        // if the stream is corrupt an OutOfMemoryError may occur
                        log.error("Stop reading corrupt stream");
                    }
                    catch (ZipException exception) 
                    {

‍‍‍‍‍‍‍‍‍‍‍‍

This appears just after the installation (alfresco is clean). I have try to increase the memory of Alfresco (JAVA_OPTS…) and check with "top" that JVM has enough memory allocated but the message still come.

Does anyone has this problem too?

neozone · ‎01-06-2010

I also get the same error while edit wiki.

benswitzer · ‎01-06-2010

I get this message as well. I have found that the error occurs when PDFs with an incompatible encoding. Don't ask me specifically which encoding, because I haven't been able to figure that out help.

For example, if I scan a doc from our copier and send it to Alfresco, all is well. The PDF is indexed and a thumbnail is created in the Share site. If I open that PDF with Adobe Acrobat, make a change and re-save it, Alfresco throws an exception when I then move that file into the Share site. No thumbnail is created. In prior versions of Alfresco (< 3.2R2), Alfresco would eventually run out of memory if too many of these incompatible PDFs were encountered. This doesn't happen now, but we still see those exceptions.

Ben

dranakan · ‎01-11-2010

Thanks you.

Yes, the problem should result from the PDF File.

Does anyone know a way to check if a PDF is wrong ? (and indicates what is wrong)

Thanks

gyro_gearless · ‎02-17-2010

Hi friends,

i just found that on my Alfresco setup this error eror occured on 13 of some 200 random PDF documents, so may i join the club?

Seriously, i consider this an major problem, for two reasons:

- As far as i understand PDFBOX, the decoding of the faulty PDFs is terminated at some random point WITH NO ERROR INDICATED TO THE CALLING CONVERTER, as the exceptions in org.apache.pdfbox.filter.FlateFilter are caught and converted into that innocent log message. Imagine your CxO not finding that important business report from last year for that reason… guess who gets kicked ass….

- and when i saw that OutOfMemoryException caught in PDFBOX, i'd liked to bang my head against the wall! WHEN I HAVE AN OUTOFMEMORYEXCEPTION IN MY APPLICATION, I WANT TO KNOW THAT!! I really have to know that, since the continued operation of my Alfresco is seriously in danger… arghhhh!

Well, i tried my luck with the current 1.0 snapshot from pdfbox.apache.org, but this was no better, so i'll propose to replace the PDFBOX converter with some external commandline tool…. i'll gonna post the configuration once it is working!

Cheers
Gyro

deepestblue · ‎02-19-2010

I'm also seeing this error, on Community Head revision 18722, and if Apache PDFbox gives the same problem, guess this needs some upstream help.

dranakan · ‎03-02-2010

Hello,

Alfresco use my CPU to 100% from several days. I suspect a problem with this :
jstack (show java process)


"DefaultScheduler_Worker-3" prio=10 tid=0x08f8b400 nid=0x86b runnable [0x62b82000]
   java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:92)
   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
   - locked <0x71801578> (a sun.nio.ch.ChannelInputStream)
   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
   - locked <0x718055a0> (a java.io.BufferedInputStream)
   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
   - locked <0x718055c0> (a java.io.BufferedInputStream)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
   - locked <0x718055e0> (a java.io.BufferedInputStream)
   at java.io.FilterInputStream.read(FilterInputStream.java:66)
   at java.io.PushbackInputStream.read(PushbackInputStream.java:122)
   at org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:84)
   at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:200)
   at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:870)
   at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:141)
   at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:213)
   at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:870)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:519)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
   at org.alfresco.repo.content.transform.PdfBoxContentTransformer.transformInternal(PdfBoxContentTransformer.java:74)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:167)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:143)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.indexProperty(ADMLuceneIndexerImpl.java:948)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocumentsImpl(ADMLuceneIndexerImpl.java:625)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocuments(ADMLuceneIndexerImpl.java:590)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.updateFullTextSearch(ADMLuceneIndexerImpl.java:1569)
   at org.alfresco.repo.search.impl.lucene.fts.FullTextSearchIndexerImpl.index(FullTextSearchIndexerImpl.java:190)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
   at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
   at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
   at $Proxy70.index(Unknown Source)
   at org.alfresco.repo.search.impl.lucene.fts.FTSIndexerJob.execute(FTSIndexerJob.java:52)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
   at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:529)
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Do you have same problem ?

(I have post my general problem here : http://forums.alfresco.com/en/viewtopic.php?f=8&t=21348#p82506)

opoplawski · ‎04-14-2010

Looks like an issue has been reported: https://issues.alfresco.com/jira/browse/ALF-1493. Looks like a possible fix may be to drop in the latest version of pdfbox (1.1.0).

slowlearner · ‎10-25-2010

I don't think 1.1.0 helps. I am a new user of 3.3g and encounter exactly the same problem. My install came with pdfbox-1.1.0.jar out of the box… or is it jar? - sorry

Most disturbing. Any help much appreciated.
Update:
I have also increased the lucene.indexer.maxfieldlength value to 1000000 and still get the problem. :x

gyro_gearless · ‎10-25-2010

We had good success by replacing the original PDFBox 0.8 with a current 1.2 version.
Previously, we had 79 PDFs that where not indexed, after the upgrade and reindexing only 10 remained unindexed! And eventually these 10 proved to be corrupt, for example there were JPEGs saved as PDF and the like 🙂

Cheers
Gyro

Hyland Connect

Alf 32r2 - Pdfbox - Stop reading corrupt stream