cancel
Showing results for 
Search instead for 
Did you mean: 

Custom metadata extraction from MS Word

col_edinburgh
Champ in-the-making
Champ in-the-making
OK after a couple of weeks I'm close to giving up now. All I want to do is extend the Office Metadata extractor to allow me to collect a custom property called projectID.

I have followed the example in the Book, Alfresco Developer Guide (2008) chapter 4 and the WIKI page
http://wiki.alfresco.com/wiki/Metadata_Extraction

but I can't do it.

Following the example in the book I can successfully map the 'keywords' property but I am lost on the 'digging into the extractor class' example . I can't seem to be able to customise the class using the steps laid out

I fear I am going to have to abandon this project as I just can't seem to make it work

Regards
28 REPLIES 28

col_edinburgh
Champ in-the-making
Champ in-the-making
If anyone would like to PM me a price to resolve this then please so. I can send you the code I have used

tsenn
Champ in-the-making
Champ in-the-making
Has this issue been solved?

I am having some trouble extracting custom properties from a .docx Word 2007 or 2010 file.

So far I tried various configurations with no success, and while extraction is working fine for an Open Document .odt, I cannot get the same mapping to work for an Office file. I didn't (cause I wouldn't like to) implement a custom extractor, trying to achieve it using pure config.

But is it at all possible for Office 2007 documents? Or do I have to use a class file to bridge the custom properties from both worlds anyway?

Here the bean I use in custom-metadata-extrators-context.xml :

    <bean id="extracter.Office" class="org.alfresco.repo.content.metadata.OfficeMetadataExtracter" parent="baseMetadataExtracter" >
        <property name="inheritDefaultMapping">
            <value>false</value>
        </property>
        <property name="mappingProperties">
            <props>
                <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
                <prop key="smfg1">cm:description</prop>
                <prop key="smfg2">cm:title</prop>
            </props>
        </property>
   </bean>

The functioning OpenDocument bean configuration looks exactly the same except for the first line of course.

Any advice or hint greatly appreciated.

Regards

col_edinburgh
Champ in-the-making
Champ in-the-making
Sadly no I haven't been able to resolve this

col_edinburgh
Champ in-the-making
Champ in-the-making
post 10 resolved the HTTP error.

The remaining problem is that I cannot get the custom extractor to work with MSOffice documents. I can get my custom extractor to extract custom properties from Open Office documents however.

tsenn
Champ in-the-making
Champ in-the-making
So to summarize things, it is OK with Open Office and Open Document Format extractors, but not with Microsoft formats…

Has therefore anyone achieved to do that, and how (pure config of the existing extractor or custom extractor)?

Any piece of information would be welcome.

Regards

stevegreenbaum
Champ in-the-making
Champ in-the-making
This is something I was just starting to research.  The one thing I found is that Tika v.09 doesn't directly support pulling non-standard properties from a Word document.  I see a note in the roadmap that this will be available in v1.0.  I assume then that pulling custom properties will get much easier in a future version of Alfresco.

col_edinburgh
Champ in-the-making
Champ in-the-making
here is my workaround

create three folders [dropbox] [openoffice] and [word]

in [dropbox] create a rule that transforms word document to openoffice text document and copy to [transform]

in [transform] create three rules:
add new aspect
extract common metadata
transform back from openoffice to word and copy to [word]

you now have a word document sitting in [word] directory with custom metadata extracted

col_edinburgh
Champ in-the-making
Champ in-the-making
the workaround doesn't work for word 2007, etc. for some reason the copy and transform to opendocument rule drops the custom properties on the from the odt file

acurs
Champ in-the-making
Champ in-the-making
Hi,
    have you tried metadawriter from Carl Nordenfelt, maybe It could help you a little… it is in alfresco forge

http://forge.alfresco.com/projects/metadatawriter/

cheers