Found some words somewhere in the forums that Alfresco can be configured to use OpenOffice in order to make MSWord -> Text transformation in order to do Full Text Indexing of Microsoft Word Files. I have done an exhaustive search all over the web but didn't find an example how to do this configuration.
I know that for the moment TextMining library is used to do .DOC to .TXT conversion, for specific actions and also for preparing text files for FTS indexing with Lucene.
But I have a MS Word file that cannot be fully indexed. More words from the end of the document cannot be found with Lucene search query.
I have transformed the .DOC to a TXT file (using Run Action "Transform and copy to destination") and found that the .TXT file is truncated, 25% of the file (from the end) is missing, so it's obviously why those words cannot be found in Lucene query TEXT search. I'm guessing that TextMining library is the problem. Make some test with POI 3.5 beta4-20081128 library and the text was extracted just perfect.
I have filed a bug in JIRA ( ALFCOM-2527 ) waiting for a sollution but until then, my question is :
Is there any chance to configure Alfresco to use OpenOffice in order to convert .DOC files to .TXT instead of TextMiningContentTransformer ?
I'm using Alfresco 3.0 Stable , OpenSUSE 11.1 i586 , 32 bit , Pentium IV 2 Gb machine , SUN Java(TM) SE Runtime Environment (build 1.6.0_11-b03).
Thanks in advance,
Teo