cancel
Showing results for 
Search instead for 
Did you mean: 

Getting metadata from the content of a document

jlabuelo
Champ on-the-rise
Champ on-the-rise
Hi there

I am trying to assgin a customized type to a document droped in an Space using a business rule, but then would also like to fulfill the properties of the type automatically obtaining the metadata from the "content" of the document.

Let me explain what I pretend to get

a) Move a Word, Excel, PDF document to a Drop Space in my application using CIFS – DONE
b) As soon as the file arrives to the Drop Space, it will get a new type assigned, which I have defined, using a business rule – DONE

c) The third step, would be reading the document (the content of the document) and find in there the values of the metadata of the type defined in step b) so I can assign it to the document.

For example, I have defined the type "CompanyInfomation" which contains two properties:
1) CompanyName
2) CompanyId.

I will load documents in a drop space, and each document will have inside somewhere in the text "Name of the Company: NAME" and "Id of the Company: Id". I would need to identify those words somehow and extract the value s NAME and ID to assign them to the properties of the document "CompanyName" and "CompanyId".

I am able to assign "fixed values" using a javascript script and hardcoding them, but now would like to get the values from the documents and dont know exactly how to do it.

I have been reading the wiki page about the MetaDataExtractor, but I did not find (maybe I did not searched well even I tried hard!!) how to extract metadata from inside the document, I know how to extract, the author, the creation date of the document…… but nothing from iside.

I have tried this JavaScript… but of course did not work….

Any ideas about how this should be focused????

Thanks a lot in advance guys!!

(JS code tried to use)
Code:

// First we read the document to find the values
var Sociedad ="";
var CIF ="";


var FileContent = document.content;
var FileLines = FileContent.split("\n");
var Lines=0;

var words;
var foundSociedad = False;
var foundtradoCIF = False;

while ((foundSociedad == False) || (foundtradoCIF == False)) && (Lines <= FileLines.length)
{
   var Word_Count=0;
   words = FileLines[Lines].split(" ");
   while ((foundSociedad == False) || (foundtradoCIF == False)) && (Word_Count <= words.length)
   {
      if (words[Word_Count] =="Sociedad:")
      {
        foundSociedad = True;
        Sociedad = words[(Word_Count +1)];
      }
      
      if (words[Word_Count] =="CIF:")
      {
        foundtradoCIF = True;
        CIF = words[(Word_Count +1)];
      }
      Word_Count=Word_Count+1;
   }
   Lines = Lines+1;

}

// Now that we have the values we apply them to the properties of the custom type.

document.properties.name="Doc_"+Sociedad+"_"+CIF;
document.properties["custom:CompanyName"]=Sociedad;
document.properties["custom:CompanyCIF"]=CIF;
document.save();
6 REPLIES 6

vicsaego
Champ in-the-making
Champ in-the-making
Hi,

I'm interesting in doing the same but I can´t find information about how to do it.
Did you find any solution?
Is is possible to do it?

Thanks very much in advance

jlabuelo
Champ on-the-rise
Champ on-the-rise
Hi there

Seems that is possible, however you would need to design a process your self, nobody from Alfresco Community has replied to this post and we are investigating how to create a script-interface in java to upload a file to alfresco and change the metadata of the file once uploaded with the information of the file.

At this moment we have this project stopped but it is one of our aims for the near future. We have discovered that this is quite difficult to implement if the document you want to scan or upload is not a form with a static structure so you make the code to always review the same "positions" of the docuement.

For your information there is actually an interface that allows you to upload an scanned document with OCR directly from the scanner to Alfresco, but you need to buy it  (KOFAX is the company that has designed this script) and it is quite expensive. If you look in google Alfresco-Kofax you will get the link needed.

Sorry can not help you more at this time. If you find any information will appreciate if you can share it with me

Regards

loftux
Star Contributor
Star Contributor
I think you will have to write some custom extractor to do that.
The challenge is not only to dissect ms word format, but know where to look in the document. You can try to force them to put it in a certain table or field, but user have a tendency to move/remove things Smiley Happy

Maybe you are better of if you write some vba code, a dialog that asks for the value and puts it in the right place in the document AND writes it to properties of the file. Then you can use Alfresco metadata extractor to transfer the values, only configuration needed.

//Peter Löfgren

jlabuelo
Champ on-the-rise
Champ on-the-rise
Hi Peter

Sounds quite interesting what you are suggesting, did you already get to implement something like this?, sorry but dont understand exactly what you mean by Wrtting in the "Properties of the document before uploading it to Alfresco".

Also could you point me where we can find some documentation to configure a metadata extractor for our Alfresco configuration?

Thanks a lot!!

loftux
Star Contributor
Star Contributor
I've not done it yet in Alfresco, although in my previous proprietary consultancy life i've done things like this.

Do a search on http://www.google.com/search?q=word+custom+properties and you will get some links, this one looks promising http://msdn.microsoft.com/en-us/library/aa537163.aspx. Either write to custom properties or the built in ms office properties, and then let Alfresco extract and overwrite on each save by setting overwrite policy to eager.

I do not know how to do it for PDFs, but the above method will work for Office documents

Here you have some links
http://wiki.alfresco.com/wiki/Metadata_Extraction
http://forums.alfresco.com/en/viewtopic.php?=&p=22670 (on the topic of how to configure extraction of custom properties, not resolved)

jlabuelo
Champ on-the-rise
Champ on-the-rise
Thanks a lot Peter

Will take a look to these links next week and will let you know what I found 😄

Regards