cancel
Showing results for 
Search instead for 
Did you mean: 
asirika
Employee
Employee

1. Overview

Alfresco Search Enterprise 3.2 consists of Alfresco Content Services, Elasticsearch Server and the Elasticsearch connectors. Further According to the official documentation there are number of prerequisites such as ActiveMQ, Postgresql Database and Transform Service. Please also note that it is not a must to have transform service running to extract general metadata.

In this post I will cover how we can Scale ES during re-indexing/ live indexing and when to use different ES connector jars.

2. Alfresco Search Enterprise (ASE)

Alfresco Content Services supports the Elasticsearch platform for searching within the repository using Alfresco Search Enterprise 3.2. Alfresco Search Enterprise module is consist of 6 jar files. 

imageASE Jar List

2.1. Re-Indexing

alfresco-elasticsearch-reindexing-3.2.0-app.jar: This is all-in-one jar file which index content, medatdata and path for existing content store.

image

However, this perticular jar comes with 3 parameters which we can configure according to the business requirement.

# Reindexing services execution

alfresco.reindex.metadataIndexingEnabled = true

alfresco.reindex.contentIndexingEnabled = true

alfresco.reindex.pathIndexingEnabled = true

Therefore if we wanted to reindex metadata only, you should pass the parmenters to the above command accordingly as below

image

Sample Search Queries to try Out:

For Metadata Search:  cm:name:'test', cm:author:admin ,cm:title:'test'

For Path Search: PATH:"/app:company_home/st:sites/cm:test/cm:documentLibrary/*"

For Content Search: cm:content:’test’

2.2. Live-Indexing

There are 5 live indexing jars available in ES connector distribution zip.

alfresco-elasticsearch-live-indexing-3.2.0-app.jar : This is all-in-one jar file which index content, medatdata and path for realtime data which consist of all 4 live-indexing jar files specific to mediation, metadata, content, and path. Unlike with all-in-one reindex jar we do not have control over what we should index.

image

When to use other live indexing jars? 

In the events that business do not have the requirement to use full text indexing(content indexing) and when deployinng at Scale. 

To start alfresco-elasticsearch-live-indexing-mediation-3.2.0-app.jar run below command.

image

alfresco-elasticsearch-live-indexing-metadata-3.2.0-app.jar: Index metadata only. To start run below command.

image

alfresco-elasticsearch-live-indexing-path-3.2.0-app.jar: Index path only

image

alfresco-elasticsearch-live-indexing-content-3.2.0-app.jar : Index content only

3. Deploying at Scale

3.1. Live-Indexing

When designing highly available systems deploying at scale is essential. Hence below diagram shows most optimized way of designing high available architecture.

imageLive-Indexing: Deploying at Scale

There will be Single point of Failure in Mediation Component as it cannot be scaleup. Therefore, it is a must that we need Monitor the mediation component and run reindexing app for the specific period in case of a failure.

3.2. Re-Indexing

It can take a large amount of time when re-indexing a large repository using a single re-index process. Therefore, with below two approaches you can scale reindexing process vertically as well as horizontally.

3.2.1. Aapproach 1

In this approach we can have multiple EC2 instances to have horizontal scaling and inside each instance we can run multiple reindexing threads.

imageRe-Indexing:Approach1

Setting Up Re-Indexer Instance

  • Copy alfresco-elasticsearch-connector-distribution-3.2 into each instance
  • We were running 6 threads on one instance and 5 threads on second instance. This can be change accordingly.
  • Run below code with unique port numbers and reindex.fromId and reindex.toId to run as many threads needed in a instance.
  • To fetch by IDS alfresco.reindex.jobName=reindexByIds: index nodes in an interval of database ALF_NODE.id column

image

3.2.2. Approach 2

Re Indexing using remote partitioning. More details can be found in Alfresco Docs. Refer: https://docs.alfresco.com/search-enterprise/latest/admin/#alfresco-elasticsearch-connector

image

To Start Manager, execute below.

java -jar alfresco-elasticsearch-reindexing-3.2.0-app.jar  
 --alfresco.reindex.jobName=reindexByIds 
--alfresco.reindex.partitioning.type=manager
--alfresco.reindex.pagesize=100 --alfresco.reindex.batchSize=100 
--alfresco.reindex.fromId=0 
 --alfresco.reindex.toId=10000 
--spring.batch.datasource.url=
       jdbc:postgresql://localhost:5432/alfresco 
 --spring.batch.datasource.username=alfresco 
--spring.batch.datasource.password=alfresco 
--spring.batch.datasource.driver-class-name=org.postgresql.Driver 
 --spring.datasource.url=jdbc:postgresql://localhost:5432/alfresco 
--spring.datasource.username=alfresco 
 --spring.datasource.password=alfresco 
--alfresco.reindex.partitioning.grid-size=20
--spring.batch.drop.script=
classpath:/org/springframework/batch/core/schema-drop-postgresql.sql 
 --spring.batch.schema.script=
classpath:/org/springframework/batch/core/schema-postgresql.sql
 
 
 

To Start Worker, execute below.

java -jar alfresco-elasticsearch-reindexing-3.2.0-app.jar 
--alfresco.reindex.partitioning.type=worker 
--alfresco.reindex.pagesize=100 --alfresco.reindex.batchSize=100 
--alfresco.reindex.concurrentProcessors=2 
--spring.batch.datasource.url=
jdbc:postgresql://localhost:5432/alfresco 
--spring.batch.datasource.username=alfresco 
--spring.batch.datasource.password=alfresco
--spring.batch.datasource.driver-class-name=org.postgresql.Driver 
--spring.datasource.url=jdbc:postgresql://localhost:5432/alfresco 
--spring.datasource.username=alfresco 
--spring.datasource.password=alfresco 
--spring.batch.drop.script=
classpath:/org/springframework/batch/core/schema-drop-postgresql.sql 
--spring.batch.schema.script=
classpath:/org/springframework/batch/core/schema-postgresql.sql
 --server.port=9091

Note: If you are re-indexing only metadata/ AND Path with remote partitioning approach, make sure to set the related properties while executing Worker command.

4. Comparison of re-indexing approaches

 

Pros

Cons

Approach 1: Multi-threading

Less time consuming, best suit for customers with larger repositories.

Considerable manual work involved setting up threads, however as re-indexing is just one time process this can be highly disregard.

Approach 2: Remote Partitioning

Slower therefore suit for customers with smaller repositories.

Easy to Manage. Number of workers/partitions can be easily managed by setting alfresco.reindex.partitioning.grid-size. Manager thread automatically assign fromId and toId values on worker nodes. 

5. Reference: