Hyland Connect

pcuvecle2 · ‎07-30-2018

Hi,

I am using Alfresco 5.1 and I have XML files to index. My XML contains tags such as

<paragraph eId="id-00000967-2e30-ecab-ad49-685fecd94436">
   <content>
      <p>Some text</p>
   </content>
</paragraph>

I would like to be able to discard XML attribute such as eId during indexing. For now if I search for eca (that is a substring of the eId) I get some results.

I've seen that I could use <charFilter class="solr.HTMLStripCharFilterFactory"/> in SOLR schema.xml but so far this does not seem to give any results.

Does someone know how to achieve this ?

Thanks !

pcuvecle2 · ‎09-07-2018

Answering to myself

The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.

The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it

View answer in original post

pcuvecle2 · ‎09-07-2018

Answering to myself

The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.

The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it

Hyland Connect

Indexing XML on Alfresco 5.1.x