Hyland Connect

revenge · ‎10-28-2009

Hi all,

In my first application I've read document content using Alfresco Web Service and apache POI 3.5 and PdfBox libs.

Now I'm developing an action for alfresco 3.2, so I cannot use POI 3.5 because alfresco already contains the 3.1 version.
For pdf documents there's no problem….because I can use the ContentReader inputStream in pdfbox and convert it into plain text (String).

But for doc/docx/odt? How can I read the document content?

Thanks in advance,

Revenge

ivo_costa · ‎10-30-2009

Hi Revenge…

check the openoffice API, Alfresco already includes it and you can use it to read a lot of document formats

Regards

Ivo Costa

revenge · ‎10-30-2009

For word (.doc) documents I've found the Word97TextExtractor….

Word97TextExtractor extractor = new Word97TextExtractor(this._stream);
strContent = extractor.getText();‍‍

while looking for other classes I've found the UnoContentTransformer.java that transforms the openoffice supported documents directly into the repository….

but it uses net.sf.joott.uno package that I don't find in the SDK

Do you have any suggestion to solve this problem? Or if you know which classes could I use?

Thanks,
Revenge

zaizi · ‎10-30-2009

You can just use Alfresco's content transformation functionality to get a text version of all supported content for you.


               ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);
                if (reader != null && reader.exists())
                {
                        // get the transformer
                        ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);
                        // is this transformer good enough?
                        if (transformer == null)
                        {
                            // We have a transformer that is fast enough
                            ContentWriter writer = contentService.getTempWriter();
                            writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);

                            try
                            {
                                transformer.transform(reader, writer);
                                // point the reader to the new-written content
                                reader = writer.getReader();
                                // Check that the reader is a view onto something concrete
                                if (!reader.exists())
                                {
                                    throw new ContentIOException("The transformation did not write any content, yet: \n"
                                            + "   transformer:     " + transformer + "\n" + "   temp writer:     " + writer);
                                }
                            }
                            catch (ContentIOException e)
                            {

 
                            }
                        }
                    }
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

revenge · ‎11-02-2009

I changed only this row

if (transformer == null)‍

in

if (transformer != null)‍

and then

reader = writer.getReader();
// Check that the reader is a view onto something concrete
if (!reader.exists()) {
   throw new ContentIOException(
   "The transformation did not write any content, yet: \n"
   + "   transformer:     " + transformer
   + "\n" + "   temp writer:     "
   + writer);
} else {
   content = reader.getContentString();
}‍‍‍‍‍‍‍‍‍‍‍

The first time I excluded this transformer because I tought it worked only in repository…. (it works in this way…but on temporary files… so I can get the reader based on the temp file)… but when you posted the code…. I've understood how it works…

Thanks very much!

Bye,
Revenge

ivo_costa · ‎11-05-2009

A side note…

that "net.sf.joott.uno" package you didn't find is part of the openoffice API

just in case you need something more complicated

mrogers · ‎11-05-2009

Another side note. The POI library is now version 3.5.

magno · ‎07-15-2010

Hello,
I'm doing (or trying to do) something like that, and I've some questions abouta that code…
I want to transform an html code, which I obtain from a jsp to word format, because I'm trying to create a document in alfresco through the web services api…I need to do the transformation before creating the word document, but I'm blocked with that.
I create a node like that:

 Store storeRef = new Store(Constants.WORKSPACE_STORE, "SpacesStore");  
             ParentReference companyHomeParent = new ParentReference(storeRef, null, "/app:company_home", Constants.ASSOC_CONTAINS, null);  
            companyHomeParent.setChildName("cm:" + name); 
                 String id=companyHomeParent.getUuid();
             Reference nodeRef = new Reference(storeRef, id, null);‍‍‍‍‍

and then I tried to probe the code it's before, but I don't understand what is contentService's value…
If someone can help me, i'll very thankful!

                   ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);
                    if (reader != null && reader.exists())
                    {
                            // get the transformer
                            ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);
                            // is this transformer good enough?
                            if (transformer == null)
                            {
                                // We have a transformer that is fast enough
                                ContentWriter writer = contentService.getTempWriter();
                                writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);

                                try
                                {
                                    transformer.transform(reader, writer);
                                    // point the reader to the new-written content
                                    reader = writer.getReader();
                                    // Check that the reader is a view onto something concrete
                                    if (!reader.exists())
                                    {
                                        throw new ContentIOException("The transformation did not write any content, yet: \n"
                                                + "   transformer:     " + transformer + "\n" + "   temp writer:     " + writer);
                                    }
                                }
                                catch (ContentIOException e)
                                {


                                }
                            }
                        }
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

magno · ‎07-15-2010

oh, something more…I don't know if to use that classes, I need some special library…for example the class ContentReader

madhuri · ‎08-25-2014

I am newly learning alfresco and trying to read content from a text file and I am trying to use ContentReader for doing same but it is giving me error asking to create an interface for that. Can any one tell me how to use ContentReader.

Hyland Connect

Read document content (doc, docx, odt)