Alfresco + XML + Lucene
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎08-10-2006 05:00 AM
Hello!
Alfresco is a great product, but it certainly loose a lot of advantages when indexing XML as simple text only.
We need to be able to store XML documents and perform a search in all documents where search terms appear in a given XML node. We don't need to have a complete XPath but just the parent node. E.g. search toto@footer where footer is an XML node of our document.
The complete XML structure is not known in advance and will certainly evolve with time.
We think this could be easily done at indexing time by dynamically adding a Lucene field for each XML tag encoutered, like Alfresco does for meta-data indexing.
The question: how to implement this in the nicest way? What Alfresco classes should be overwritten? It's quite difficult to understand now how alfresco interacts with Lucene.
Thanks for any advice!
P.S. The problematic is quite similar to http://forums.alfresco.com/viewtopic.php?t=277, but simplier in our case. Is Alfresco going to do something for XML documents indexing?
Alfresco is a great product, but it certainly loose a lot of advantages when indexing XML as simple text only.
We need to be able to store XML documents and perform a search in all documents where search terms appear in a given XML node. We don't need to have a complete XPath but just the parent node. E.g. search toto@footer where footer is an XML node of our document.
The complete XML structure is not known in advance and will certainly evolve with time.
We think this could be easily done at indexing time by dynamically adding a Lucene field for each XML tag encoutered, like Alfresco does for meta-data indexing.
The question: how to implement this in the nicest way? What Alfresco classes should be overwritten? It's quite difficult to understand now how alfresco interacts with Lucene.
Thanks for any advice!
P.S. The problematic is quite similar to http://forums.alfresco.com/viewtopic.php?t=277, but simplier in our case. Is Alfresco going to do something for XML documents indexing?
Labels:
- Labels:
-
Archive
5 REPLIES 5

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎08-29-2006 11:30 AM
I would like to do something very similar, does anyone know the plans?

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎08-30-2006 10:28 AM
Same problem!

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎09-05-2006 05:01 AM
Hi
This is something we plan to address in the future. At the moment content including XML is just converted into text and indexed. You could write your own action to extract meta data and populate some predefined properties, just like there is for word docs and pdfs.
Properties need to be defined in the model. There is no support for any old property as defined by any tag you may find in the xml.
At the moment, you can extract information from elements in your xml docs into a defined property and then use that for search. An action would be best for this.
We do not support internal queries into XML documents. This does sound possible - but not using the current search API.
There is no reason why you can not add additional fields to the lucene index if you find an XML doc. You would have to alter LuceneIndexerImpl to do this if you find an XML type. You may also need to add support to determine the type of each field in LuceneAnalyser.
Regards
Andy
This is something we plan to address in the future. At the moment content including XML is just converted into text and indexed. You could write your own action to extract meta data and populate some predefined properties, just like there is for word docs and pdfs.
Properties need to be defined in the model. There is no support for any old property as defined by any tag you may find in the xml.
At the moment, you can extract information from elements in your xml docs into a defined property and then use that for search. An action would be best for this.
We do not support internal queries into XML documents. This does sound possible - but not using the current search API.
There is no reason why you can not add additional fields to the lucene index if you find an XML doc. You would have to alter LuceneIndexerImpl to do this if you find an XML type. You may also need to add support to determine the type of each field in LuceneAnalyser.
Regards
Andy

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎09-19-2006 05:43 AM
Perhaps it could be useful to look at the Compass project. Compass addresses this kind of problem by means of an Object to Search Engine Mapping or XML to Search Engine Mapping. The bottom line is quite simple: a hibernate-like mapping is defined that maps object properties or xml fields to Lucene fields. In other words, an object or xml file with a certain layout is mapped to a Lucene document, which is then inserted into the index. Compass is open source so you can easily take a deeper look at how this is achieved…
kind regards
kind regards

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎09-17-2007 05:36 AM
Hi
There is now support for XPATH extraction from XML documents into metadata. You can then search in the metadata and not the whole doc as transformed into text.
Andy
There is now support for XPATH extraction from XML documents into metadata. You can then search in the metadata and not the whole doc as transformed into text.
Andy
