Hyland Connect

dmorozov · ‎05-12-2011

Hello,
I have been fighting last week with Alfresco going terribly slow because (I think) of Tika transformations happening in background.
Please provide an advice how to solve this issue.

We have Alfresco 3.4.d installed on Ubuntu 64 bit server.
RAM: 16G
CPU: 4
JVM settings: -Djava.awt.headless=true -server -Xss1M -Xms1G -Xmx4G -XX:NewSize=1G -XX:MaxPermSize=512M -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:CMSInitiatingOccupancyFraction=80 -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:+UseTLAB
Database -> MySQL on separate server
Content repository size: 23G
Server: Apache tomcat 6.0.32

Alfresco starting with memory about 1.5G and after some time memory usage jumped up to 4.6G
This seems okay while it has good throughout. No slowness, now errors.

But after some time it go really slow and then hang. Even if nobody use the site for some time. I don't know what cause the issue but here is what I have:
1. Linux top shows the average CPU utilization is about 25% (can assume that one of CPUs loaded for ~100% ???)
2. memory dump (kill -3 PID) shows always the same picture. The only really interesting thread that always showed while slowness is Tika transformer for Excel files started from full text search job:

"DefaultScheduler_Worker-2" prio=10 tid=0x0000000041280800 nid=0x7891 runnable [0x00007fe37f4f2000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.xmlbeans.impl.store.Locale.count(Locale.java:2049)
        at org.apache.xmlbeans.impl.store.Xobj.count_elements(Xobj.java:2050)
        at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTSheetDataImpl.sizeOfRowArray(Unknown Source)
        - locked <0x00007fe3e3352a58> (a org.apache.xmlbeans.impl.store.Locale)
        at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTSheetDataImpl$1RowList.size(Unknown Source)
        at java.util.AbstractList$Itr.hasNext(AbstractList.java:339)
        at org.apache.poi.xssf.usermodel.XSSFSheet.initRows(XSSFSheet.java:177)
        at org.apache.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:147)
        at org.apache.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:134)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:234)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:182)
        at org.apache.poi.xssf.extractor.XSSFExcelExtractor.<init>(XSSFExcelExtractor.java:56)
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:172)
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:68)
        at org.alfresco.repo.content.TikaOfficeDetectParser.parse(TikaOfficeDetectParser.java:78)
        at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(TikaPoweredContentTransformer.java:185)
        at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:161)
        at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:137)
        at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.indexProperty(ADMLuceneIndexerImpl.java:944)
        at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocumentsImpl(ADMLuceneIndexerImpl.java:620)
        at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocuments(ADMLuceneIndexerImpl.java:585)
        at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.updateFullTextSearch(ADMLuceneIndexerImpl.java:1580)
        at org.alfresco.repo.search.impl.lucene.fts.FullTextSearchIndexerImpl.index(FullTextSearchIndexerImpl.java:217)
        at sun.reflect.GeneratedMethodAccessor329.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:107)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
        at $Proxy79.index(Unknown Source)
        at org.alfresco.repo.search.impl.lucene.fts.FTSIndexerJob.execute(FTSIndexerJob.java:46)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)

3. JProfiler showed memory allocation mostly caused by the same Tica transformer classes.
Most memory taken by xmlbeans, poi and openxmlformats packages and allocation tree showed the same transformation job.

4. Full re-indexing done without any issues.

Can anybody suggest what else I can do and what is the reason of all that?
Is it common to have Alfresco taking almost 5G of RAM?
How can I disable CONTENT indexing for Excel files (that doesn't make sense for me)?
I believe that users can upload pretty big Excel files into repository (say 3M-10M) can it cause the issue?

Any suggestions are appreciated.

dmorozov · ‎05-12-2011

By the way we have enough free space on hard drive so this is not an issue.

Additionally I tried to run Alfresco with 2G max heap size configured and saw exactly the same issue.
After some time (5-15 min) memory was pumped up to max but it was still handling all http requests properly.
And eventually it slowed down and hanged. CPU utilization showed with JConsole was 100% and it caused by GC called too often trying to release memory.

We have small amount of users working with Alfresco (10-20 maximum).
We have reporting tool traversing JCR tree to generate report but JProfiler didn't show any memory leaks related that.

vmm · ‎05-13-2011

Hi,

we have the same problem with 3.4d.

For us the problem is caused when Alfresco tries to index a .xlsx file that have one million rows.

the threaddump:

"schedulerFactory_Worker-4" - Thread t@30
   java.lang.Thread.State: RUNNABLE
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClassCond(Unknown Source)
   at java.lang.ClassLoader.defineClass(Unknown Source)
   at java.security.SecureClassLoader.defineClass(Unknown Source)
   at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733)
   - locked org.apache.catalina.loader.WebappClassLoader@1dec1dd
   at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124)
   at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612)
   - locked org.apache.catalina.loader.WebappClassLoader@1dec1dd
   at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Unknown Source)
   at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplClass(SchemaTypeImpl.java:1709)
   at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor(SchemaTypeImpl.java:1725)
   at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1853)
   at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021)
   at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893)
   at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657)
   at org.apache.xmlbeans.impl.store.Xobj.find_element_user(Xobj.java:2062)
   at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTRowImpl.getCArray(Unknown Source)
   - locked org.apache.xmlbeans.impl.store.Locale@3f3b50
   at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTRowImpl$1CList.get(Unknown Source)
   at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTRowImpl$1CList.get(Unknown Source)
   at java.util.AbstractList$Itr.next(Unknown Source)
   at org.apache.poi.xssf.usermodel.XSSFRow.<init>(XSSFRow.java:66)
   at org.apache.poi.xssf.usermodel.XSSFSheet.initRows(XSSFSheet.java:178)
   at org.apache.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:147)
   at org.apache.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:134)
   at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:234)
   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
   at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:182)
   at org.apache.poi.xssf.extractor.XSSFExcelExtractor.<init>(XSSFExcelExtractor.java:56)
   at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:172)
   at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:68)
   at org.alfresco.repo.content.TikaOfficeDetectParser.parse(TikaOfficeDetectParser.java:78)
   at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(TikaPoweredContentTransformer.java:185)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:161)
   at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:137)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.indexProperty(ADMLuceneIndexerImpl.java:944)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocumentsImpl(ADMLuceneIndexerImpl.java:620)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.createDocuments(ADMLuceneIndexerImpl.java:585)
   at org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerImpl.updateFullTextSearch(ADMLuceneIndexerImpl.java:1580)
   at org.alfresco.repo.search.impl.lucene.fts.FullTextSearchIndexerImpl.index(FullTextSearchIndexerImpl.java:217)
   at sun.reflect.GeneratedMethodAccessor406.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
   at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:107)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
   at $Proxy82.index(Unknown Source)
   at org.alfresco.repo.search.impl.lucene.fts.FTSIndexerJob.execute(FTSIndexerJob.java:46)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
   at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)

loftux · ‎05-13-2011

I'm not aware of a way to prevent indexing of a specific file, but if you know what file it is, put a password on the file. That way transformation will fail (rather than halting the whole server). If needed, write the password in the description field so that those who needs to open can do so.

It would just be a workaround, and not a terribly good one, just thought that it may be worth having as an option.

Another would be to disable content transformations for all excel files until properly resolved. Have a look in content-services-context.xml to find out what beans to change/disable. Maybe comment out the bean <bean id="transformer.Poi", that way I think it will user OpenOffice instead. But you have to try this out.

mrogers · ‎05-13-2011

I've had a chat with our resident Tika Guru who pointed out that this sounds like TIKA-521 which was fixed late last year.

So you may like to try the latest version of Tika.

vmm · ‎05-16-2011

We tried with tika 0.9 and had the same problem.

For the moment we have comment the mapping for xlsx and xls files in mimetype-map.xml, so Alfresco stores excel files like binarys.

mrogers · ‎05-16-2011

You need version 1.0 or above…

dmorozov · ‎05-20-2011

Thank you very much it helped.

So here is solution for others:
1. Checkout Tika sources trunk (google for it)
2. Build.

Notes: It will create tika 1.0 snapshot version.
Only one issue I got with compilation is missed jdom artifact in tika-parsers submodule. Just add this dependency into tika-parsers/pom.xml and build.

Additionally make sure that you will update all dependent libraries. For example it brings new versions of Apache POI (3.8-beta2) and PDFBox (1.5) libraries.
In my case I just created empty web application with maven and put tika-core and tika-parsers as dependencies. Maven will collect all required libs for you.
Then you will just need to make sure that in your Alfresco you have correct versions. Add extra libraries resolved by maven just in case.

After patching Tika I able to run my Alfresco with 2G heap size configured and average memory usage is about 600M with jumps up to 2G while documents re-indexing.

promethius · ‎02-07-2012

I am having the exact same issue, Could you give me any step by step instructions on how you fixed this. Thanks.

mrogers · ‎02-07-2012

Just upgrade to 4.0.

Hyland Connect

Alfresc indexing slow due to transformation