Hyland Connect

jzaruba · ‎10-11-2010

I have PDF containing (multi-page) scanned text document and XML containing its OCR output. Obviously I need the text data to be searchable. With my near-zero experience with Alfresco it would be much easier to store the text-data in a property of that PDF's aspect. (And to throw the XML away completely.) My concern is though whether there's any limit to such properties.
I expect the documents to be several tens of pages long at maximum, which IMO very safely fits into 1MB of text in one property. (The PDF will also carry several other tiny properties.) That should not pose any problem for the underlying db (MySQL in my case). Is there anything in Alfresco that should prevent me from choosing this approach?

ethan · ‎10-11-2010

Hi

I tried the same approach but actually there is a limit for text property. One cannot contain more than ~65 000 characters. The best approach to perform a search on a document is to put it into the cm:content property as a stream.

jzaruba · ‎10-11-2010

Thank you for the reply…
…I will look more into the cm:content type, I did not know about it, or that it can provide a stream.

Thanks!

jzaruba · ‎10-11-2010

May I ask you please what is the proper way of populating such property in Java?
I've defined the property in my Aspect…
–
<property name="com:textContent">
<type>cm:content</type>
</property>
–
Should I store (and possibly hide) a text file/document somewhere and then pass its NodeRef as the property value?

ethan · ‎10-11-2010

I'm not sure you have to add a new property to your model. If your model inherit from the alfresco out-of-the-box cm:content model, there is alreay a cm:content property.

How do you add your files to the repository? If you use the "add content" action, then the content of your pdf file is automatically added to the cm:content property. if you use a java class to programmatically add new content, please see the Introduction to the Alfresco Java Content Repository API.

jzaruba · ‎10-11-2010

Thanks for your time (and patience), ethan.

I'm not sure you have to add a new property to your model. If your model inherit from the alfresco out-of-the-box cm:content model, there is alreay a cm:content property.

How do you add your files to the repository? If you use the "add content" action,

That's the case at the moment. I'm adding the PDF-files via Web Client UI.

then the content of your pdf file is automatically added to the cm:content property.

Wouldn't then assigning the text content to cm:content property result in loss of the PDF binary data? (Or is the data in cm:content property mere (quite useless) copy of the PDF file that Alfresco keeps in its filesystem?)

if you use a java class to programmatically add new content, please see the Introduction to the Alfresco Java Content Repository API.

As far as I can see there are two ways of assigning a value to cm:content property In the examples: either by passing a string value (which I guess bears the 65k limitation) or by passing a stream, as you mentioned earlier. If my understanding is correct I need to obtain somehow the stream to existing NodeRef (the OCR output (which I'd have to extract out of XML)) and assign it to cm:content…

ethan · ‎10-11-2010

The binary of the PDF file is inside the cm:content property and Alfresco can search within it. So if you just add your file with the "add content" action, you should be able to search for the text which is inside the xml in your pdf file. Did you try it?

With the java content repository API, you can indeed add content with the method Node.setProperty("cm:content", "my new content") but I think you also need to specify a mimetype. I'm not sure you can put a simple string inside the cm:content property and then search for it with alfresco.

jzaruba · ‎10-12-2010

The binary of the PDF file is inside the cm:content property and Alfresco can search within it. So if you just add your file with the "add content" action, you should be able to search for the text which is inside the xml in your pdf file. Did you try it?

But how do I link the XML into my PDF? (I also took a look at actions that are available for the already uploaded PDF file, but I don't see anything that would let me create such link.)
I need to be able to do this stuff using API anyways, but I guess I must be missing something important here…
Just to be sure which 'action "Add content"' you mean, this is where I upload the PDF:
[img]http://dl.dropbox.com/u/219075/Alfresco2.PNG[/img]

With the java content repository API, you can indeed add content with the method Node.setProperty("cm:content", "my new content") but I think you also need to specify a mimetype. I'm not sure you can put a simple string inside the cm:content property and then search for it with alfresco.

My understanding was you were actually assigning the stream into cm:content. Weren't you?
Anyways, I guess you're right about the mime-type, this is what I suppose should go into cm:content:
http://wiki.alfresco.com/wiki/Data_Dictionary_Guide#Data_Types
ContentData(java.lang.String contentUrl, java.lang.String mimetype, long size, java.lang.String encoding)

Oh and BTW, this attempt of mine did not pass anyways

–
<property name="com:textContent">
<type>cm:content</type>
</property>
–

ethan · ‎10-12-2010

–
<property name="com:textContent">
<type>cm:content</type>
</property>
–

This is because the type must be d:content, not cm:content

(Look at the /alfresco/WEB-INF/classes/alfresco/model/contentModel.xml file).

Sorry, my mistake, I thought your xml code was already inside the PDF file u_u. You could take a look at the metadata extractors which are called after the file is uploaded on alfresco. Maybe you could implement your own extracter to parse the xml file associated to your pdf file and modify the cm:content property of your pdf file node.

You could also check Content transformation and OCR integration (here and there).

As i'm not skilled enough with this part of alfresco process, I can't provide more precise informations =( Hope it'll help though.

jzaruba · ‎10-12-2010

As i'm not skilled enough with this part of alfresco process, I can't provide more precise informations =( Hope it'll help though.

Thanks for your time & effort, I'm gonna look into it.
Cheers
JZ

Hyland Connect

Is there any limit to text properties?