cancel
Showing results for 
Search instead for 
Did you mean: 

Solr Does Not Index Xml Elements / Attributes

larophel
Champ in-the-making
Champ in-the-making
Hello,

I have noticed that Solr sometimes does not index the names of Xml elements and Xml attribute values.
Element content, on the other hand, is always indexed.
I am using Alfresco 4.2.c Community, on Linux. The installation was performed using the default options.

I put together a list of steps, which can be used to reproduce this issue:

1. Start Alfresco (./alfresco.sh start) and log in to Alfresco Share.
2. Click on "Repository".
3. Click on "Upload" and upload a simple Xml file. You can use the attached Xml file as an example (remove the "_.txt" extension).
4. Wait some time to ensure that Solr finishes processing / indexing the new file. One minute should be enough.
5. Restart Alfresco (./alfresco.sh restart).
6. Refresh the browser page.
7. Click on "Copy to" in the menu for the file from step 3. You can use the same folder as the copy target.
8. Wait some time to ensure that Solr finishes processing the new file. Search for an attribute value, which appears in the file, e.g. "myvalueb".

Expected result: the search result will list both the files from step 3 and step 7.
Actual result: the search result only shows the file from step 3.
Searching for element content (e.g. "myvaluea") will list both files.

Please note: I have observed that sometimes the restart of Alfresco is not needed to reproduce the problem. However, the only consistent way I have found to reproduce this issue is with the restart.

I am attaching two log files, from when I ran above mentioned test. One time, I used the default logging level. The second time, I enabled DEBUG output for Solr.
In the "catalina_solr_debug.txt" file, you can observe that the file from step 3 gets indexed with Xml elements and attributes:


SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - << "<?xml version="1.0" encoding="UTF-8"?>[\n]"
SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - << "<mytaga>[\n]"
SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - << "<mytagb>myvaluea</mytagb>[\n]"
SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - << "<mytagc myattr="myvalueb"/>[\n]"
SOLR DEBUG 2013-08-23 12:52:30:708 content.wire:70 - << "</mytaga>[\n]"


The file from step 7 does not:


SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - << "[\n]"
SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - << "myvaluea[\n]"
SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - << "[\n]"
SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - << "[\n]"


Any help in solving this issue would be greatly appreciated.

Regards,
2 REPLIES 2

larophel
Champ in-the-making
Champ in-the-making
I forgot to mention: the Xml file must have the "<?xml version="1.0" encoding="UTF-8"?>" header, otherwise the problem cannot be reproduced.

larophel
Champ in-the-making
Champ in-the-making
I have found the cause of the problem.
Alfresco uses a "Transformer" to obtain the to-be-indexed text from a node.
For the mimetype "text/xml", there are two possible transformers: "TikaAuto" and "StringExtracter".

The "TikaAuto" transformer only returns Xml element text content.
The "StringExtracter" transformer returns the full Xml file content.

For each transformation, Alfresco measures the time it takes to perform the transformation.
Which transformer is chosen depends on the average time for past transformations.
So, this is why sometimes the full text content is indexed and sometimes only Xml element text content.