cancel
Showing results for 
Search instead for 
Did you mean: 

Building an XML Repository

ctanis
Champ in-the-making
Champ in-the-making
We are interested in using Alfresco as an XML repository. 

In our system, incoming documents would be validated against a schema, and meta-data would be extracted from the XML.  Documents would then be sorted into spaces or categories based on this meta-data. Presumably these could be implemented as custom actions?

We expect massive load on our search application, so it would be good to mirror the Lucene index and contentstore. (but that could be as simple as a periodic rsync).  Is meta-data stored in the Lucene index or only via hibernate?

I imagine that a lot of our functionality would be custom-implemented java code, and the Alfresco user interface would be used as an admin tool for editing xml and managing rules for sorting.

My immediate goal is to implement the fundamentals of indexing XML documents.  How would I go about adding a custom indexing component for XML documents? Specifically, which configuration files are related and what interfaces need to be implemented? Or is there some other way to do this?

I know you are planning real XML functionality at some point, and I was curious how you envision such a system working.  Is it worth implementing something like this with the early alfresco releases, or will it be obsolesced by planned XML functionality?  Is Alfresco appropriate for something like this?

Thanks,
craig
11 REPLIES 11

rdanner
Champ in-the-making
Champ in-the-making
We really have to start a conversation on alfresco plugin frameworks.  We need some way to build custom config and code and deploy it in a way in which it auto meshes with the "out of the box" software.

This would put alfresco miles ahead of any other open source systems and probably out in front period.  This kind of functionality is critical.  when people modify the core source base they put their entire system at risk in future upgrades.  this is not a good way.

cheers
-R

andy
Champ on-the-rise
Champ on-the-rise
Hi

Is meta-data stored in the Lucene index or only via hibernate?

Meta data is indexed in lucene, it would be possible to store it in lucene but we currently go to hibernate to get the meta data. Basically, if any property is not stored in the index we would get all properties (at once) from hibernate, so we do that.

How would I go about adding a custom indexing component for XML documents?

At the moment tokenisation is controlled by the property type.
If we added an xml property type you could plug in whatever lucene tokenixer you fancy for your XML doc.

The intention is to allow the tokeniser to be specified in the property definitition (or use the default for the property type). This is not available yet. The task is in jira as AR-177.

Content is indexed after converting to text form XML. You could control what conversion takes place and therefore how xml content is indexed.

Dave may have somthing to say about adding a pluggable content pipeline for processing content as it is added. This could do the sort of stuff you are talking about: extracting meta data, building sub structure etc.

Cheers

Andy

ctanis
Champ in-the-making
Champ in-the-making
I agree that discussing a plugin framework is a good idea, but I was starting to get the impression that a lot of what I wanted to do was capable without touching the java source.
At this point my goal is to appraise the capabilities of Alfresco, not necessarily to build a final working system.

Thanks for the tips, Andy.  What about adding new  fields to the lucene index, for example to index each XML element as a different field.   One option is to have the desired fields as meta-data in a new content type, but I can imagine a scenario in which more flexibility is desired.   This is why I thought I might need a new indexing component, as opposed to just a new tokenizer.

More importantly, will Alfresco allow arbitrary fields in the index?  Will lucene performance suffer if this is too arbitrary?

Thanks,
Craig

andy
Champ on-the-rise
Champ on-the-rise
Hi

The indexer takes care of indexing whatever node is given to it, including all of its properties. At the moment you can have multi-valued properties and they will be indexed together. All other properties will be indexed in their own right. Indexing, tokenising and the tokeniser can be controlled via the data dictionary.

I would suggest mapping XML elements to nodes and attributes to properties as a first go. The lucene index supports an internal attribute PATH with allows you to do the name step bit of XPath on elements. I am guessing you will want something like this. And the XPATH selectNodes and selectAttributes on the searcher will also do what you expect. I would put the text() in a content node of some sort. You should then have path/content/meta data for searching.

This way you would not have to change the indexing.

You could map the xml in any other way - may be keep the document as one node and just add meta data. Put the key attributes as properties, extract the text into a content property and keep the XML as another content property. 

I have not looked at any performance here, particularly in generating so many small documents.

I would suggest something like our import/export would do the trick.
For import we upload a file and then perform an action on it.
You could do the same for your xml content - and it could be automated as you put stuff in a given space….it could get moved and expanded into some other area. Please note that we have not set out to be an XML store with XQuery etc.

Categorisation could be done at the same time.  It would then be available as part of the lucene query language. If you split the document up you could categorise elements as well as the document. The category is queriable as part of PATH as categories are treated as vitual folders.

The indexer uses a field for each attribute, mainly to make searching direct against the index sensible using lucene. This and PATH/TYPE probably gives 80% of what you want to do in a search.

It would be interesting to know a bit more about your problem.

Cheers

Andy

henry
Champ in-the-making
Champ in-the-making
Andy:

I'm interested in using Alfresco as an XML repository also. But we'd like to integrated a xml database into alfresco so as to process massive xml reository easily. The xml database already provides search functionality.
All the xml content will store in another xml database.  When system do a search, it just search in both original lucene part and xml database and return the search results of two parts. Do you think there are any technical obstacles in this way?


thanks in advance

henry

ctanis
Champ in-the-making
Champ in-the-making
I have written a new action that will set a bunch of properties on a node based on contents of the xml in the node contents.

What benefit do I get from having properties on a node without having any corresponding aspect to manage or use them?

Is there a reason not to do that?

I would like to be able to search based on these properties.  Apparently, the UI search does not alllow this, but presumably I can search on them using the API?

Also, it seems like the current available actions can't take advantage of properties for making decisions, even when they are part of installed aspects, specifically moving nodes or categorizing them.  Would it be appropriate/doable to make a custom action that did such a thing? 

Is dealing with properties without aspects philosophically compatible with Alfresco?

Thanks,
Craig

andy
Champ on-the-rise
Champ on-the-rise
Hi Henry

You will have to do quite a lot of work to get an XML repository linked in.

I am assuming that the database is not compatible with hibernate, and you would not want hibernate in any case, so you will require a new NodeService to support the persistence of meta data. You will also need a ContentService against an XML database.

The indexing into lucene should work given the node service and content services above. However, I am not sure if you would do all the search against lucene or the xml database. You could certainly combine the two using the NodeRef for each node as the key. The search can be configured to do what you want and you could have a SearchService that just goes to the XML database(s) or that also includes lucene in some way.

You would have to think about how scores from each result set are used to produce an overall result set. There are ideas in lucene for this.

Regards

Andy

rdanner
Champ in-the-making
Champ in-the-making
Hi Henry

You will have to do quite a lot of work to get an XML repository linked in.

I am assuming that the database is not compatible with hibernate, and you would not want hibernate in any case, so you will require a new NodeService to support the persistence of meta data. You will also need a ContentService against an XML database.

The indexing into lucene should work given the node service and content services above. However, I am not sure if you would do all the search against lucene or the xml database. You could certainly combine the two using the NodeRef for each node as the key. The search can be configured to do what you want and you could have a SearchService that just goes to the XML database(s) or that also includes lucene in some way.

You would have to think about how scores from each result set are used to produce an overall result set. There are ideas in lucene for this.

Regards

Andy

Hibernate is working on an XML based implementation.  I doubt what they are doing is directly related to what you are thinking of.  Anyway.. check the site you can download the beta code.

podz
Champ in-the-making
Champ in-the-making
Hi Henry

You will have to do quite a lot of work to get an XML repository linked in.

I am assuming that the database is not compatible with hibernate, and you would not want hibernate in any case, so you will require a new NodeService to support the persistence of meta data. You will also need a ContentService against an XML database.

The indexing into lucene should work given the node service and content services above. However, I am not sure if you would do all the search against lucene or the xml database. You could certainly combine the two using the NodeRef for each node as the key. The search can be configured to do what you want and you could have a SearchService that just goes to the XML database(s) or that also includes lucene in some way.

You would have to think about how scores from each result set are used to produce an overall result set. There are ideas in lucene for this.

Regards

Andy


The natural backend for this would be the Berkeley XML-DB, which has native XML support, extensible search matching at the key:value level, replication, support for 4GB files and 256TB databases.

It's GPL if you use it in a GPL product, otherwise you need to license it.

Using a non-XML backend keeps exactly the same problems as the Documentum has: inability to do intelligent searches, inability to do version diffs, and inability to escape from microsoft once you have stored 100k ms-word documents.

Simply reinventing the wheel with a better user interface will only make low-end users happy. If you really want to sell this thing, you need to enable businesses to move away from microsoft.