Alfresco Search Enterprise is the Search Engine available for Alfresco Enterprise deployments that uses an external Elasticsearch 7.x or OpenSearch 1.x service. This blog post covers implementation details for the Indexing and Reindexing components.
Indexing
The Repo Event Channel is a topic populated by Repository that delivers a copy of every incoming message to any subscriber. Since a message represents a node event, following scenarios need to be addressed:
If we define the indexing components (metadata, content and path) as direct subscribers of the event channel, it won't be possible to scale up them: if there are multiple instances of the content indexing component, each of them will receive a copy of the same event node related with a specific node id; that means each instance will activate the same indexing workflow for the same node, resulting in a lot of useless process duplication.
Note that permissions (ACLs) are indexed by metadata indexing, as they are part of the incoming message from Repo. Documents in search index include metadata, permissions, content and path together. Once the document has been created in the index by metadata indexing, content and path can be updated.
The Live Indexing Mediator is in charge to
The mediator is consuming events from event2 topic in ActiveMQ. Default value can be customized using following property:
alfresco.event.topic = activemq:topic:alfresco.repo.event2
There are three different queues used by the mediator to place new messages for metadata, content, and path. Live Indexing component is consuming these messages to perform required action, like indexing new metadata or requesting a transformation to text so the content can be indexed. Default values can be customized using following properties:
alfresco.metadata.event.channel = activemq:queue:org.alfresco.search.metadata.event alfresco.content.event.channel = activemq:queue:org.alfresco.search.content.event alfresco.path.event.channel = activemq:queue:org.alfresco.search.path.event
Blacklisted attributes
The Mediation component relies on a configuration file which acts as a blacklist containing
The blacklist file path / reference can be specified through usual Spring configuration capabilities. That means:
The default value of that property is classpath:mediation-filter.yml, it points to a file included in the bundle which provides following rules:
mediation: nodeTypes: contentNodeTypes: nodeAspects: - sys:hidden fields: - cmis:changeToken - alfcmis:nodeRef - cmis:isImmutable - cmis:isLatestVersion - cmis:isMajorVersion - cmis:isLatestMajorVersion - cmis:isVersionSeriesCheckedOut - cmis:versionSeriesCheckedOutBy - cmis:versionSeriesCheckedOutId - cmis:checkinComment - cmis:contentStreamId - cmis:isPrivateWorkingCopy - cmis:allowedChildObjectTypeIds - cmis:sourceId - cmis:targetId - cmis:policyText - trx:password - pub:publishingEventPayload
There is no support for regular expressions to specify values for the different categories, every excluded type, aspect, or field must be included individually.
There is no filtering property for path indexing, but setting cm:indexControl aspect can be used to avoid a folder hierarchy to be indexed.
Note that in addition to this filtering process in Search Enterprise side, Repository configuration is ignoring a set of types, aspects and associations that are defined using following properties:
repo.event2.filter.nodeTypes=sys:*, fm:*, cm:thumbnail, cm:failedThumbnail, cm:rating, rma:rmsite include_subtypes repo.event2.filter.nodeAspects=sys:* repo.event2.filter.childAssocTypes=rn:rendition
Instead of using a different value for specifying a blacklist file (classpath:mediation-filter.yml) you can provide your own mediation-filter.yml by including and prepending it to the application classpath.
By means of that file, the admin can define a list of fields that won't be sent to Elasticsearch or OpenSearch.
A field having a match in such blacklist could be:
Reindexing
The Reindexing component has the responsibility to re-index the full repository or a portion of Alfresco nodes.
This may be useful when:
Reindexing app is built on top of Spring Batch framework:
Before running the Reindexing app, generating a JSON map of namespace to prefix is required. The project Alfresco Model Namespace-Prefix Mapping can be used to create this reindex.prefixes-file.json file. This external file can be specified using following environment variable:
alfresco.reindex.prefixes-file=file:reindex.prefixes-file.json
Database settings are set using default Spring properties:
spring.datasource.url=jdbc:postgresql://localhost:5432/alfresco spring.datasource.username=alfresco spring.datasource.password=alfresco spring.datasource.hikari.maximumPoolSize=20 # Based on the DataSource configuration an implementation for accessing the repo database is created. # Sometimes it might happened that it is not possible to autodetect the correct database type. # This optional property allows you to disable the auto-detection and to specify the database type directly. # Supported values: postgresql, mysql, mariadb, sqlserver, oracle alfresco.dbType=
Reindexing values can be specified using following properties:
alfresco.reindex.jobName=reindexByIds alfresco.reindex.batchSize=100 alfresco.reindex.pagesize=100 alfresco.reindex.concurrentProcessors=10 alfresco.reindex.fromId=0 alfresco.reindex.toId=20000000000 alfresco.reindex.fromTime=190001010000 alfresco.reindex.toTime=203012312359
Enabling or disabling features can be also configured by properties:
alfresco.reindex.metadataIndexingEnabled = true alfresco.reindex.contentIndexingEnabled = true alfresco.reindex.pathIndexingEnabled = true
>> Additional instructions to scale up reindexing process are available in official documentation.