07-30-2018 10:37 AM
Hi,
I am using Alfresco 5.1 and I have XML files to index. My XML contains tags such as
<paragraph eId="id-00000967-2e30-ecab-ad49-685fecd94436">
<content>
<p>Some text</p>
</content>
</paragraph>
I would like to be able to discard XML attribute such as eId during indexing. For now if I search for eca (that is a substring of the eId) I get some results.
I've seen that I could use <charFilter class="solr.HTMLStripCharFilterFactory"/> in SOLR schema.xml but so far this does not seem to give any results.
Does someone know how to achieve this ?
Thanks !
09-07-2018 06:00 AM
Answering to myself
The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.
The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it
09-07-2018 06:00 AM
Answering to myself
The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.
The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it
Explore our Alfresco products with the links below. Use labels to filter content by product module.