cancel
Showing results for 
Search instead for 
Did you mean: 

Is there any limit to text properties?

jzaruba
Champ in-the-making
Champ in-the-making
I have PDF containing (multi-page) scanned text document and XML containing its OCR output. Obviously I need the text data to be searchable. With my near-zero experience with Alfresco it would be much easier to store the text-data in a property of that PDF's aspect. (And to throw the XML away completely.) My concern is though whether there's any limit to such properties.
I expect the documents to be several tens of pages long at maximum, which IMO very safely fits into 1MB of text in one property. (The PDF will also carry several other tiny properties.) That should not pose any problem for the underlying db (MySQL in my case). Is there anything in Alfresco that should prevent me from choosing this approach?
9 REPLIES 9

ethan
Champ in-the-making
Champ in-the-making
Hi Smiley Happy

I tried the same approach but actually there is a limit for text property. One cannot contain more than ~65 000 characters. The best approach to perform a search on a document is to put it into the cm:content property as a stream.

jzaruba
Champ in-the-making
Champ in-the-making
Thank you for the reply…
…I will look more into the cm:content type, I did not know about it, or that it can provide a stream.

Thanks!

jzaruba
Champ in-the-making
Champ in-the-making
May I ask you please what is the proper way of populating such property in Java?
I've defined the property in my Aspect…

<property name="com:textContent">
   <type>cm:content</type>
</property>

Should I store (and possibly hide) a text file/document somewhere and then pass its NodeRef as the property value?

ethan
Champ in-the-making
Champ in-the-making
I'm not sure you have to add a new property to your model. If your model inherit from the alfresco out-of-the-box cm:content model, there is alreay a cm:content property.

How do you add your files to the repository? If you use the "add content" action, then the content of your pdf file is automatically added to the cm:content property. if you use a java class to programmatically add new content, please see the Introduction to the Alfresco Java Content Repository API.

jzaruba
Champ in-the-making
Champ in-the-making
Thanks for your time (and patience), ethan.

I'm not sure you have to add a new property to your model. If your model inherit from the alfresco out-of-the-box cm:content model, there is alreay a cm:content property.

How do you add your files to the repository? If you use the "add content" action,

That's the case at the moment. I'm adding the PDF-files via Web Client UI.

then the content of your pdf file is automatically added to the cm:content property.

Wouldn't then assigning the text content to cm:content property result in loss of the PDF binary data? (Or is the data in cm:content property mere (quite useless) copy of the PDF file that Alfresco keeps in its filesystem?)

if you use a java class to programmatically add new content, please see the Introduction to the Alfresco Java Content Repository API.

As far as I can see there are two ways of assigning a value to cm:content property In the examples: either by passing a string value (which I guess bears the 65k limitation) or by passing a stream, as you mentioned earlier. If my understanding is correct I need to obtain somehow the stream to existing NodeRef (the OCR output (which I'd have to extract out of XML)) and assign it to cm:content…

ethan
Champ in-the-making
Champ in-the-making
The binary of the PDF file is inside the cm:content property and Alfresco can search within it. So if you just add your file with the "add content" action, you should be able to search for the text which is inside the xml in your pdf file. Did you try it?

With the java content repository API, you can indeed add content with the method Node.setProperty("cm:content", "my new content") but I think you also need to specify a mimetype. I'm not sure you can put a simple string inside the cm:content property and then search for it with alfresco.

jzaruba
Champ in-the-making
Champ in-the-making
The binary of the PDF file is inside the cm:content property and Alfresco can search within it. So if you just add your file with the "add content" action, you should be able to search for the text which is inside the xml in your pdf file. Did you try it?

But how do I link the XML into my PDF? (I also took a look at actions that are available for the already uploaded PDF file, but I don't see anything that would let me create such link.)
I need to be able to do this stuff using API anyways, but I guess I must be missing something important here…
Just to be sure which 'action "Add content"' you mean, this is where I upload the PDF:
[img]http://dl.dropbox.com/u/219075/Alfresco2.PNG[/img]

With the java content repository API, you can indeed add content with the method Node.setProperty("cm:content", "my new content") but I think you also need to specify a mimetype. I'm not sure you can put a simple string inside the cm:content property and then search for it with alfresco.

My understanding was you were actually assigning the stream into cm:content. Weren't you?
Anyways, I guess you're right about the mime-type, this is what I suppose should go into cm:content:
http://wiki.alfresco.com/wiki/Data_Dictionary_Guide#Data_Types
ContentData(java.lang.String contentUrl, java.lang.String mimetype, long size, java.lang.String encoding)

Oh and BTW, this attempt of mine did not pass anyways Smiley Happy

<property name="com:textContent">
   <type>cm:content</type>
</property>

ethan
Champ in-the-making
Champ in-the-making

<property name="com:textContent">
   <type>cm:content</type>
</property>

This is because the type must be d:content, not cm:content Smiley Wink (Look at the /alfresco/WEB-INF/classes/alfresco/model/contentModel.xml file).

Sorry, my mistake, I thought your xml code was already inside the PDF file u_u. You could take a look at the metadata extractors which are called after the file is uploaded on alfresco. Maybe you could implement your own extracter to parse the xml file associated to your pdf file and modify the cm:content property of your pdf file node.

You could also check Content transformation and OCR integration (here and there).

As i'm not skilled enough with this part of alfresco process, I can't provide more precise informations =( Hope it'll help though.

jzaruba
Champ in-the-making
Champ in-the-making
As i'm not skilled enough with this part of alfresco process, I can't provide more precise informations =( Hope it'll help though.

Thanks for your time & effort, I'm gonna look into it.
Cheers
  JZ