cancel
Showing results for 
Search instead for 
Did you mean: 

Read document content (doc, docx, odt)

revenge
Champ in-the-making
Champ in-the-making
Hi all,

In my first application I've read document content using Alfresco Web Service and apache POI 3.5 and PdfBox libs.

Now I'm developing an action for alfresco 3.2, so I cannot use POI 3.5 because alfresco already contains the 3.1 version.
For pdf documents there's no problem….because I can use the ContentReader inputStream in pdfbox and convert it into plain text (String).

But for doc/docx/odt? How can I read the document content?

Thanks in advance,

   Revenge
10 REPLIES 10

ivo_costa
Champ in-the-making
Champ in-the-making
Hi Revenge…

check the openoffice API, Alfresco already includes it and you can use it to read a lot of document formats

Regards

Ivo Costa

revenge
Champ in-the-making
Champ in-the-making
For word (.doc) documents I've found the Word97TextExtractor….
Word97TextExtractor extractor = new Word97TextExtractor(this._stream);
strContent = extractor.getText();

while looking for other classes I've found the UnoContentTransformer.java that transforms the openoffice supported documents directly into the repository….

but it uses net.sf.joott.uno package that I don't find in the SDK

Do you have any suggestion to solve this problem? Or if you know which classes could I use?


Thanks,
  Revenge

zaizi
Champ in-the-making
Champ in-the-making
You can just use Alfresco's content transformation functionality to get a text version of all supported content for you.


               ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);
                if (reader != null && reader.exists())
                {
                        // get the transformer
                        ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);
                        // is this transformer good enough?
                        if (transformer == null)
                        {
                            // We have a transformer that is fast enough
                            ContentWriter writer = contentService.getTempWriter();
                            writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);

                            try
                            {
                                transformer.transform(reader, writer);
                                // point the reader to the new-written content
                                reader = writer.getReader();
                                // Check that the reader is a view onto something concrete
                                if (!reader.exists())
                                {
                                    throw new ContentIOException("The transformation did not write any content, yet: \n"
                                            + "   transformer:     " + transformer + "\n" + "   temp writer:     " + writer);
                                }
                            }
                            catch (ContentIOException e)
                            {


                            }
                        }
                    }

revenge
Champ in-the-making
Champ in-the-making
I changed only this row
if (transformer == null)
in
if (transformer != null)
and then

reader = writer.getReader();
// Check that the reader is a view onto something concrete
if (!reader.exists()) {
   throw new ContentIOException(
   "The transformation did not write any content, yet: \n"
   + "   transformer:     " + transformer
   + "\n" + "   temp writer:     "
   + writer);
} else {
   content = reader.getContentString();
}

The first time I excluded this transformer because I tought it worked only in repository…. (it works in this way…but on temporary files… so I can get the reader based on the temp file)… but when you posted the code…. I've understood how it works…

Thanks very much!

Bye,
  Revenge

ivo_costa
Champ in-the-making
Champ in-the-making
A side note…

that "net.sf.joott.uno" package you didn't find is part of the openoffice API


just in case you need something more complicated Smiley Wink

mrogers
Star Contributor
Star Contributor
Another side note.  The POI library is now version 3.5.

magno
Champ in-the-making
Champ in-the-making
Hello,
I'm doing (or trying to do) something like that, and I've some questions abouta that code…
I want to transform an html code, which I obtain from a jsp to word format, because I'm trying to create a document in alfresco through the web services api…I need to do the transformation before creating the word document, but I'm blocked with that.
I create a node like that:
 Store storeRef = new Store(Constants.WORKSPACE_STORE, "SpacesStore");  
             ParentReference companyHomeParent = new ParentReference(storeRef, null, "/app:company_home", Constants.ASSOC_CONTAINS, null); 
            companyHomeParent.setChildName("cm:" + name);
                 String id=companyHomeParent.getUuid();
             Reference nodeRef = new Reference(storeRef, id, null);
and then I tried to probe the code it's before, but I don't understand what is contentService's value…
If someone can help me, i'll very thankful!
           

                   ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);
                    if (reader != null && reader.exists())
                    {
                            // get the transformer
                            ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);
                            // is this transformer good enough?
                            if (transformer == null)
                            {
                                // We have a transformer that is fast enough
                                ContentWriter writer = contentService.getTempWriter();
                                writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);

                                try
                                {
                                    transformer.transform(reader, writer);
                                    // point the reader to the new-written content
                                    reader = writer.getReader();
                                    // Check that the reader is a view onto something concrete
                                    if (!reader.exists())
                                    {
                                        throw new ContentIOException("The transformation did not write any content, yet: \n"
                                                + "   transformer:     " + transformer + "\n" + "   temp writer:     " + writer);
                                    }
                                }
                                catch (ContentIOException e)
                                {


                                }
                            }
                        }

magno
Champ in-the-making
Champ in-the-making
oh, something more…I don't know if to use that classes, I need some special library…for example the class ContentReader

madhuri
Champ in-the-making
Champ in-the-making
I am newly learning alfresco and trying to read content from a text file and I am trying to use ContentReader for doing same but it is giving me error asking to create an interface for that. Can any one tell me how to use ContentReader.