<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Solr Does Not Index Xml Elements / Attributes in Alfresco Archive</title>
    <link>https://connect.hyland.com/t5/alfresco-archive/solr-does-not-index-xml-elements-attributes/m-p/280411#M233541</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hello,&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I have noticed that Solr sometimes does not index the names of Xml elements and Xml attribute values.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Element content, on the other hand, is always indexed. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;I am using Alfresco 4.2.c Community, on Linux. The installation was performed using the default options.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I put together a list of steps, which can be used to reproduce this issue:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;1. Start Alfresco (./alfresco.sh start) and log in to Alfresco Share. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;2. Click on "Repository". &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3. Click on "Upload" and upload a simple Xml file. You can use the attached Xml file as an example (remove the "_.txt" extension).&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;4. Wait some time to ensure that Solr finishes processing / indexing the new file. One minute should be enough.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;5. Restart Alfresco (./alfresco.sh restart). &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;6. Refresh the browser page. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;7. Click on "Copy to" in the menu for the file from step 3. You can use the same folder as the copy target. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;8. Wait some time to ensure that Solr finishes processing the new file. Search for an attribute value, which appears in the file, e.g. "myvalueb". &lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Expected result: the search result will list both the files from step 3 and step 7.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Actual result: the search result only shows the file from step 3. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Searching for element content (e.g. "myvaluea") will list both files. &lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Please note: I have observed that sometimes the restart of Alfresco is not needed to reproduce the problem. However, the only consistent way I have found to reproduce this issue is with the restart. &lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I am attaching two log files, from when I ran above mentioned test. One time, I used the default logging level. The second time, I enabled DEBUG output for Solr.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;In the "catalina_solr_debug.txt" file, you can observe that the file from step 3 gets indexed with Xml elements and attributes:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt;&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - &amp;lt;&amp;lt; "&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;[\n]"&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - &amp;lt;&amp;lt; "&amp;lt;mytaga&amp;gt;[\n]"&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - &amp;lt;&amp;lt; "&amp;lt;mytagb&amp;gt;myvaluea&amp;lt;/mytagb&amp;gt;[\n]"&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:52:30:707 content.wire:70 - &amp;lt;&amp;lt; "&amp;lt;mytagc myattr="myvalueb"/&amp;gt;[\n]"&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:52:30:708 content.wire:70 - &amp;lt;&amp;lt; "&amp;lt;/mytaga&amp;gt;[\n]"&lt;BR /&gt;&lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The file from step 7 does not:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;PRE class="language-none line-numbers"&gt;&lt;CODE&gt;&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - &amp;lt;&amp;lt; "[\n]"&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - &amp;lt;&amp;lt; "myvaluea[\n]"&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - &amp;lt;&amp;lt; "[\n]"&lt;BR /&gt;SOLR DEBUG 2013-08-23 12:56:30:626 content.wire:70 - &amp;lt;&amp;lt; "[\n]"&lt;BR /&gt;&lt;SPAN class="line-numbers-rows"&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;SPAN&gt;‍&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Any help in solving this issue would be greatly appreciated.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Regards,&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Fri, 23 Aug 2013 18:51:33 GMT</pubDate>
    <dc:creator>larophel</dc:creator>
    <dc:date>2013-08-23T18:51:33Z</dc:date>
    <item>
      <title>Solr Does Not Index Xml Elements / Attributes</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/solr-does-not-index-xml-elements-attributes/m-p/280411#M233541</link>
      <description>Hello,I have noticed that Solr sometimes does not index the names of Xml elements and Xml attribute values.Element content, on the other hand, is always indexed. I am using Alfresco 4.2.c Community, on Linux. The installation was performed using the default options.I put together a list of steps, wh</description>
      <pubDate>Fri, 23 Aug 2013 18:51:33 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/solr-does-not-index-xml-elements-attributes/m-p/280411#M233541</guid>
      <dc:creator>larophel</dc:creator>
      <dc:date>2013-08-23T18:51:33Z</dc:date>
    </item>
    <item>
      <title>Re: Solr Does Not Index Xml Elements / Attributes</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/solr-does-not-index-xml-elements-attributes/m-p/280412#M233542</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;I forgot to mention: the Xml file must have the "&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;" header, otherwise the problem cannot be reproduced. &lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 23 Aug 2013 19:28:11 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/solr-does-not-index-xml-elements-attributes/m-p/280412#M233542</guid>
      <dc:creator>larophel</dc:creator>
      <dc:date>2013-08-23T19:28:11Z</dc:date>
    </item>
    <item>
      <title>Re: Solr Does Not Index Xml Elements / Attributes</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/solr-does-not-index-xml-elements-attributes/m-p/280413#M233543</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;I have found the cause of the problem.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Alfresco uses a "Transformer" to obtain the to-be-indexed text from a node.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;For the mimetype "text/xml", there are two possible transformers: "TikaAuto" and "StringExtracter".&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The "TikaAuto" transformer only returns Xml element text content.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;The "StringExtracter" transformer returns the full Xml file content.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;For each transformation, Alfresco measures the time it takes to perform the transformation.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Which transformer is chosen depends on the average time for past transformations.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;So, this is why sometimes the full text content is indexed and sometimes only Xml element text content. &lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 26 Aug 2013 14:54:20 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/solr-does-not-index-xml-elements-attributes/m-p/280413#M233543</guid>
      <dc:creator>larophel</dc:creator>
      <dc:date>2013-08-26T14:54:20Z</dc:date>
    </item>
  </channel>
</rss>

