cancel
Showing results for 
Search instead for 
Did you mean: 
asirika
Employee
Employee

Overview 

It is not a new norm that most of the customers are adapting to Alfresco Enterprise Search, and the requirements grow day by day. Therefore, this blog post is written to address the need of indexing content/metadata conditionally in a bigger repo.

Use Case:

Conditionally perform content indexing for documents for future uploads and existing documents for a group/subset of documents.  Simply customer wants to exclude content indexing for doctype stt: statements ,however metadata should should be indexed.

Discussion: Traditionally, the IndexControl aspect has been used to restrict content or metadata indexing (see Control-Indexes for more details). However, this approach may not be suitable for all customer environments, particularly those with millions of data in existing repositories. As a result, there is a growing need to explore more efficient and viable alternatives."

Thanks to the Configuring Blacklist Sets feature in Alfresco Enterprise Search, it is now possible to define specific doctypes that should be excluded from indexing. Refer Configuring-Blacklist-Sets for more details. These blacklists can be specified in the file using the alfresco.mediation.filter-file attribute. The default file is called mediation-filter.yml that must be in the module classpath, see the sample content of that file:

mediation:
  nodeTypes:
  contentNodeTypes:
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken
    - alfcmis:nodeRef
    - cmis:isImmutable
    - cmis:isLatestVersion
    - cmis:isMajorVersion
    - cmis:isLatestMajorVersion
    - cmis:isVersionSeriesCheckedOut
    - cmis:versionSeriesCheckedOutBy
    - cmis:versionSeriesCheckedOutId
    - cmis:checkinComment
    - cmis:contentStreamId
    - cmis:isPrivateWorkingCopy
    - cmis:allowedChildObjectTypeIds
    - cmis:sourceId
    - cmis:targetId
    - cmis:policyText
    - trx:password
    - pub:publishingEventPayload

Where:

  • nodeTypes: if the node wrapped in the incoming event has a type which is included in this set, the node processing is skipped.
  • contentNodeTypes: if the node wrapped in the incoming event has a content change associated with it and it has a type which is included in this set, then the corresponding content processing will not be executed. This means nodes belonging to one of the node types in this set, won’t have any content indexed in Elasticsearch.
  • nodeAspects: if the node wrapped in the incoming event has an aspect which is included in this set, the node processing is skipped.
  • fields: fields listed in this set are removed from the incoming nodes metadata. This means fields in this set will not be sent to Elasticsearch for indexing, and therefore they won’t be searchable.

In our case we need to set the blacklisted doctypes to the contentNodeTypes attribute in yml file.

Solution Implementation

  1. Create 2 content Models.
    1. Model 1: financialReportModel.xml
<?xml version="1.0" encoding="UTF-8"?>

<!-- Custom Model -->

<!-- Note: This model is pre-configured to load at startup of the Repository.  So, all custom -->
<!--       types and aspects added here will automatically be registered -->

<model name="cr:financilaReport" xmlns="http://www.alfresco.org/model/dictionary/1.0">

   <!-- Optional meta-data about the model -->   
   <description>Custom Model2</description>
   <author></author>
   <version>1.0</version>

   <imports>
   	  <!-- Import Alfresco Dictionary Definitions -->
      <import uri="http://www.alfresco.org/model/dictionary/1.0" prefix="d"/>
      <!-- Import Alfresco Content Domain Model Definitions -->
      <import uri="http://www.alfresco.org/model/content/1.0" prefix="cm"/>
   </imports>

   <namespaces>
      <namespace uri="cr.custom.model" prefix="cr"/>
   </namespaces>
   
   <types>
      <type name="cr:financialReport">
         <title>Financial Reports</title>
         <parent>cm:content</parent>
         <properties>
            <property name="cr:vendorCode">
            	<title>vendorCode</title>
   				<description></description>
    			<type>d:text</type>
    			<mandatory>false</mandatory>
    			<multiple>false</multiple>
    			<index enabled="true">
       				<tokenised>both</tokenised>
    			</index>
			</property>
        </properties>
      </type>
      
    </types>
      
</model>​
  • Model 2 : finacialStatements.xml
<?xml version="1.0" encoding="UTF-8"?>

<!-- Custom Model -->

<!-- Note: This model is pre-configured to load at sttartup of the Repository.  So, all custtom -->
<!--       types and aspects added here will automatically be registtered -->

<model name="stt:statementsModel" xmlns="http://www.alfresco.org/model/statementsModel/1.0">

   <!-- Optional meta-data about the model -->   
   <description>Custtom Model</description>
   <author></author>
   <version>1.0</version>

   <imports>
   	  <!-- Import Alfresco Dictionary Definitions -->
      <import uri="http://www.alfresco.org/model/dictionary/1.0" prefix="d"/>
      <!-- Import Alfresco Content Domain Model Definitions -->
      <import uri="http://www.alfresco.org/model/content/1.0" prefix="cm"/>
   </imports>

   <!-- Introduction of new namespaces defined by this model -->
   <!-- NOTE: The following namespace custtom.model should be changed to reflect your own namespace -->
   <namespaces>
      <namespace uri="stt.custtom.model" prefix="stt"/>
   </namespaces>
   
   <types>
      <type name="stt:statements">
         <title>statements</title>
         <parent>cm:content</parent>
         <properties>
            <property name="stt:statementId">
   				 <title>statementId</title>
   				 <description></description>
    			         <type>d:text</type>
    			         <mandatory>false</mandatory>
    			         <multiple>false</multiple>
    			         <index enabled="true">
       			         <tokenised>both</tokenised>
    		               </index>
		 </property>
        </properties>
      </type>     
    </types>    
</model>

2.  Bootstrap the created models

3. Startup Transform Service . Refer Alfresco-Transform-Service official documentation for setup.

3. Create / update mediation-filter.yml  and place it in the directory where you a have alfresco-elastic-search jar files.

In our case stt: statements docType goes under contentNodeTypes tag in yml file which we considered content Indexing is not required. 

mediation:
  nodeTypes:
  contentNodeTypes:
    - stt:statements
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken
    - alfcmis:nodeRef
    - cmis:isImmutable
    - cmis:isLatestVersion
    - cmis:isMajorVersion
    - cmis:isLatestMajorVersion
    - cmis:isVersionSeriesCheckedOut
    - cmis:versionSeriesCheckedOutBy
    - cmis:versionSeriesCheckedOutId
    - cmis:checkinComment
    - cmis:contentStreamId
    - cmis:isPrivateWorkingCopy
    - cmis:allowedChildObjectTypeIds
    - cmis:sourceId
    - cmis:targetId
    - cmis:policyText
    - trx:password
    - pub:publishingEventPayload

Live-indexing 

Meditation : 

While starting the mediator component we need to pass the location of the updated mediation-filter.yml into the attribute alfresco.mediation.filter-file

Content Indexing and metadata indexing is enabled by default. Refer Alfresco-Live-Indexing-app for more details.

java -jar alfresco-elasticsearch-live-indexing-mediation-x.x.x-app.jar \
 --server.port=8081 --spring.activemq.broker-url=tcp://localhost:61616  \
 --spring.activemq.user=admin --spring.activemq.password=admin \ 
 --alfresco.path-indexing-component.enabled=false \
 --alfresco.accepted-content-media-types-cache.base-url=http://localhost:8090/transform/config \
 --alfresco.mediation.filter-file=file:mediation-filter.yml

Content Indexer

java -jar alfresco-elasticsearch-live-indexing-content-x.x.x-app.jar \
  --server.port=8083 --spring.activemq.broker-url=tcp://localhost:61616 \
  --spring.activemq.user=admin --spring.activemq.password=admin \
  --spring.elasticsearch.rest.uris=http://localhost:9200 

Metadata Indexer

java -jar alfresco-elasticsearch-live-indexing-metadata-x.x.x-app.jar \
  --server.port=8082 \
  --spring.activemq.broker-url=tcp://localhost:61616 \
  --spring.activemq.user=admin --spring.activemq.password=admin \
  --spring.elasticsearch.rest.uris=http://localhost:9200

Reindexing 

While starting the mediator component we need to pass the location of the updated mediation-filter.yml into the attribute alfresco.mediation.filter-file

java -jar alfresco-elasticsearch-reindexing-x.x.x-app.jar \
  --alfresco.reindex.jobName=reindexByIds \
  --spring.elasticsearch.rest.uris=http://localhost:9200 \
  --spring.datasource.url=jdbc:postgresql://localhost:5432/alfresco_25.1_0_ES \
  --spring.datasource.username=username \
  --spring.datasource.password=Password \
  --alfresco.reindex.prefixes-file=file:reindex.prefixes-file.json \
  --spring.activemq.broker-url=nio://localhost:61616 \
  --server.port=9194  --alfresco.reindex.pathIndexingEnabled=false  \
  --alfresco.mediation.filter-file=file:mediation-filter.yml

Conclusion

 By default, metadata and content indexing are enabled across the entire repository during live-indexing or re-indexing, unless explicitly restricted. Therefore, in this way, it enables us to add as many docType  and field entries as needed under the relevant tag in mediation-filter.yml, and blacklist the docTypes or/and fields we  do not want to index in Elastic Search/OpenSearch.