cancel
Showing results for 
Search instead for 
Did you mean: 

Metadata Extraction MS Word

johnpelquingua
Champ in-the-making
Champ in-the-making
Hi All,

How can I customized my alfresco share to extract the following metadata's in an MS Word?


/**
* Office file format Metadata Extracter.  This extracter uses the POI library to extract
* the following:
* <pre>
*   <b>author:</b>             –      cm:author
*   <b>title:</b>              –      cm:title
*   <b>subject:</b>            –      cm:description
*   <b>createDateTime:</b>     –      cm:created
*   <b>lastSaveDateTime:</b>   –      cm:modified
*   <b>comments:</b>
*   <b>editTime:</b>
*   <b>format:</b>
*   <b>keywords:</b>
*   <b>lastAuthor:</b>
*   <b>lastPrinted:</b>
*   <b>osVersion:</b>
*   <b>thumbnail:</b>
*   <b>pageCount:</b>
*   <b>wordCount:</b>


For example I want to extract just the keywords of an MS Word Document what are the steps I should make to accomplish that?

I have follow the steps on this tutorial (http://wiki.alfresco.com/wiki/Metadata_Extraction) but it seems that it doesn't get me anywhere.

Can you please advice..

Your help is very much appreciated.


Best Regards,
JP
3 REPLIES 3

mitpatoliya
Star Collaborator
Star Collaborator
first of all you need to understand how it is implemented in alfresco.
you can extend it based on your requirement.

1)override extractor
2)in your custom extractor class you need to extract properties from noderef using libereries like tika
3) mapping of extracted values and model properties though your property file which is also overriden
4) deploy everything and you will get the results.

Hi mitpatoliya,

Thank you for your response I was able to map everything properly but now I am encountering this error:



[ERROR] Failed startup of context org.mortbay.jetty.webapp.WebAppContext@2586117a{/share,/home/johnpelquingua/armis/runner/../share/target/share.war}
org.springframework.beans.factory.CannotLoadBeanClassException: Cannot find class [org.alfresco.repo.content.metadata.OfficeMetadataExtracter] for bean with name 'extracter.Office' defined in file [/tmp/Jetty_0_0_0_0_8080_share.war__share__.dayk2h_3989603446849538071/webapp/WEB-INF/classes/alfresco/web-extension/content-services-context.xml]; nested exception is java.lang.ClassNotFoundException: org.alfresco.repo.content.metadata.OfficeMetadataExtracter



All I did is put the content-services-context.xml and custom-metadata-extractors-context.xml under src/main/amp/config/alfresco/web-extension/ and my OfficeMetadataExtracter on this path alfresco/src/main/java/org/alfresco/repo/content/metadata/OfficeMetadataExtracter.java

Am I doing it right?

Any advice?

Your help is much appreciated.


Best Regards,
JP




mitpatoliya
Star Collaborator
Star Collaborator
This file content-services-context.xml should go under <tomcat>/webapps/alfresco/WEB-INF/classes/alfresco/extension that is the reason you are getting error.