cancel
Showing results for 
Search instead for 
Did you mean: 

Alf 32r2 - Pdfbox - Stop reading corrupt stream

dranakan
Champ on-the-rise
Champ on-the-rise
Hello,

I get a error message when I am uploading some PDF in Alfresco (with Mysql, 3.2r2)


ERROR [pdfbox.filter.FlateFilter] Stop reading corrupt stream
….

Looking in the src of pdfbox I have found :


}
                    catch (OutOfMemoryError exception)
                    {
                        // if the stream is corrupt an OutOfMemoryError may occur
                        log.error("Stop reading corrupt stream");
                    }
                    catch (ZipException exception)
                    {


This appears just after the installation (alfresco is clean). I have try to increase the memory of Alfresco (JAVA_OPTS…) and check with "top" that JVM has enough memory allocated but the message still come.

Does anyone has this problem too?
13 REPLIES 13

neozone
Champ in-the-making
Champ in-the-making
I also get the same error while edit wiki.

benswitzer
Champ in-the-making
Champ in-the-making
I get this message as well.  I have found that the error occurs when PDFs with an incompatible encoding.  Don't ask me specifically which encoding, because I haven't been able to figure that out help.

For example, if I scan a doc from our copier and send it to Alfresco, all is well.  The PDF is indexed and a thumbnail is created in the Share site.  If I open that PDF with Adobe Acrobat, make a change and re-save it, Alfresco throws an exception when I then move that file into the Share site.  No thumbnail is created.  In prior versions of Alfresco (< 3.2R2), Alfresco would eventually run out of memory if too many of these incompatible PDFs were encountered.  This doesn't happen now, but we still see those exceptions.

Ben

dranakan
Champ on-the-rise
Champ on-the-rise
Thanks you.

Yes, the problem should result from the PDF File.

Does anyone know a way to check if a PDF is wrong ? (and indicates what is wrong)

Thanks

gyro_gearless
Champ in-the-making
Champ in-the-making
Hi friends,

i just found that on my Alfresco setup this error eror occured on 13 of some 200 random PDF documents, so may i join the club?

Seriously, i consider this an major problem, for two reasons:

- As far as i understand PDFBOX, the decoding of the faulty PDFs is terminated at some random point WITH NO ERROR INDICATED TO THE CALLING CONVERTER, as the exceptions in org.apache.pdfbox.filter.FlateFilter are caught and converted into that innocent log message. Imagine your CxO not finding that important business report from last year for that reason… guess who gets kicked ass….  Smiley Surprised

- and when i saw that OutOfMemoryException caught in PDFBOX, i'd liked to bang my head against the wall! WHEN I HAVE AN OUTOFMEMORYEXCEPTION IN MY APPLICATION, I WANT TO KNOW THAT!! I really have to know that,  since the continued operation of my Alfresco is seriously in danger… arghhhh!

Well, i tried my luck with the current 1.0 snapshot from pdfbox.apache.org, but this was no better, so i'll propose to replace the PDFBOX converter with some external commandline tool…. i'll gonna post the configuration once it is working!

Cheers
Gyro

deepestblue
Champ in-the-making
Champ in-the-making
I'm also seeing this error, on Community Head revision 18722, and if Apache PDFbox gives the same problem, guess this needs some upstream help.

dranakan
Champ on-the-rise
Champ on-the-rise
Hello,

Alfresco use my CPU to 100% from several days. I suspect a problem with this :
jstack (show java process)

"DefaultScheduler_Worker-3" prio=10 tid=0x08f8b400 nid=0x86b runnable [0x62b82000]
   java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:92)
   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
   - locked <0x71801578> (a sun.nio.ch.ChannelInputStream)
   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
   - locked <0x718055a0> (a java.io.BufferedInputStream)
   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
   - locked <0x718055c0> (a java.io.BufferedInputStream)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
   - locked <0x718055e0> (a java.io.BufferedInputStream)
   at java.io.FilterInputStream.read(FilterInputStream.java:66)
   at java.io.PushbackInputStream.read(PushbackInputStream.java:122)
   at org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:84)
   at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:200)
   at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:870)
   at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:141)
   at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:213)
   at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:870)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:519)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
   at org.alfresco.repo.content.transform.PdfBoxContentTransformer.transformInternal(PdfBoxContentTransformer.java:74)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:167)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:143)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.indexProperty(ADMLuceneIndexerImpl.java:948)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocumentsImpl(ADMLuceneIndexerImpl.java:625)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocuments(ADMLuceneIndexerImpl.java:590)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.updateFullTextSearch(ADMLuceneIndexerImpl.java:1569)
   at org.alfresco.repo.search.impl.lucene.fts.FullTextSearchIndexerImpl.index(FullTextSearchIndexerImpl.java:190)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
   at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
   at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
   at $Proxy70.index(Unknown Source)
   at org.alfresco.repo.search.impl.lucene.fts.FTSIndexerJob.execute(FTSIndexerJob.java:52)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
   at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:529)

Do you have same problem ?

(I have post my general problem here : http://forums.alfresco.com/en/viewtopic.php?f=8&t=21348#p82506)

opoplawski
Champ in-the-making
Champ in-the-making
Looks like an issue has been reported: https://issues.alfresco.com/jira/browse/ALF-1493.  Looks like a possible fix may be to drop in the latest version of pdfbox (1.1.0).

slowlearner
Champ in-the-making
Champ in-the-making
I don't think 1.1.0 helps. I am a new user of 3.3g and encounter exactly the same problem. My install came with pdfbox-1.1.0.jar out of the box… or is it jar? - sorry  Smiley Happy
Most disturbing. Any help much appreciated.
Update:
I have also increased the lucene.indexer.maxfieldlength value to 1000000 and still get the problem.  :x

gyro_gearless
Champ in-the-making
Champ in-the-making
We had good success by replacing the original PDFBox 0.8 with a current 1.2 version.
Previously, we had 79 PDFs that where not indexed, after the upgrade and reindexing only 10 remained unindexed! And eventually these 10 proved to be corrupt, for example there were JPEGs saved as PDF and the like 🙂

Cheers
Gyro