cancel
Showing results for 
Search instead for 
Did you mean: 

POI - Extracting text from MSWord document

stebans
Champ in-the-making
Champ in-the-making
Hi,

I would like to know what would be the way to extract a piece of text from uploaded MS Word document. This information is to be used as metadata.

It's easy to extract some text from Word using POI using HWPFDocument present in POI version 3.0+ (poi-scratchpad-3.0.1-FINAL-20070705.jar). Unfortunately, POI hasn't been updated in Alfresco-2.1, and is still POI-2.5.1 without extra lib. I cannot upgrade with scratchpad which depends on POI-3.0+ (if I'm right).

I cannot read content from any MS Word file with the present configuration.  Do you have a workaround?

Thanks
Best regards.
stephane
3 REPLIES 3

kevinr
Star Contributor
Star Contributor
We will upgrade POI in a future version - it's on the list but not at the top Smiley Happy

We use the text-mining jar library http://www.textmining.org/TextMining to successfully extract text from MS Word files.

Take a look at the source for the class org.alfresco.repo.content.transform.TextMiningContentTransformer for a very simple example.

Hope this helps,

Kevin

stebans
Champ in-the-making
Champ in-the-making
Thanks a lot Kevin. I may use textmining instead of poi for content extraction.
Best regards
Stephane

brailateo
Champ in-the-making
Champ in-the-making
Found some words somewhere in the forums that Alfresco can be configured to use OpenOffice in order to make MSWord -> Text transformation in order to do Full Text Indexing of Microsoft Word Files. I have done an exhaustive search all over the web but didn't find an example how to do this configuration.

I know that for the moment TextMining library is used to do .DOC to .TXT conversion, for specific actions and also for preparing text files for FTS indexing with Lucene.

But I have a MS Word file that cannot be fully indexed. More words from the end of the document cannot be found with Lucene search query.
I have transformed the .DOC to a TXT file (using Run Action "Transform and copy to destination") and found that the .TXT file is truncated, 25% of the file (from the end) is missing, so it's obviously why those words cannot be found in Lucene query TEXT search. I'm guessing that TextMining library is the problem. Make some test with POI 3.5 beta4-20081128 library and the text was extracted just perfect.

I have filed a bug in JIRA ( ALFCOM-2527 ) waiting for a sollution but until then, my question is :
Is there any chance to configure Alfresco to use OpenOffice in order to convert .DOC files to .TXT instead of TextMiningContentTransformer ?

I'm using Alfresco 3.0 Stable , OpenSUSE 11.1 i586 , 32 bit , Pentium IV 2 Gb machine , SUN Java(TM) SE Runtime Environment (build 1.6.0_11-b03).
Thanks in advance,
Teo