cancel
Showing results for 
Search instead for 
Did you mean: 
angelborroy
Community Manager Community Manager
Community Manager

Every document uploaded to an Alfresco repository that is configured for full-text search goes through a transformation pipeline: the binary content is converted to plain text, and that text is then pushed into the search index. This means that every content-indexed node generates work for the Transform Service. In enterprise deployments, this involves remote and asynchronous transforms via Transform Core workers and Transform Router. In community deployments, transforms run locally and synchronously within the repository process. When this is multiplied by thousands or millions of documents, the performance implications become significant.

This post provides a technical walkthrough of how content indexing works across the three Alfresco search subsystems: Search Services, Search and Insight Engine, and Search Enterprise. And how it can be tuned or disabled to reduce the load on Transform infrastructure.

Architecture Overview

Before diving into configuration, it helps to understand the data flow from content upload to searchable index entry.

The Indexing Pipeline

Regardless of which search subsystem is used, the fundamental pipeline is the same:

  1. A document with binary content is created or updated in the Alfresco Repository.
  2. A text rendition is requested: the binary is sent to the Transform Service, which converts it to text/plain.
  3. The plain-text output is sent to the search engine, either Solr or Elasticsearch/OpenSearch, for indexing.
  4. The search engine stores that text together with metadata, enabling full-text queries.

The expensive step is the transformation to plain text. Transforming a large PDF or a complex Office document is CPU and memory intensive. Transforming thousands of them concurrently creates sustained load on Transform workers, queue backlogs, and memory pressure. Disabling content indexing for nodes where full-text search is unnecessary removes that transform work entirely.

Search Services and Search and Insight Engine (Solr-based)

Search Services, available for both Community and Enterprise, and Search and Insight Engine, available for Enterprise, are both built on a customized Apache Solr architecture tightly coupled with the repository:

  • The Tracker Subsystem runs inside each Solr instance and periodically polls the repository for changes.
  • The subsystem includes several tracker types: MetadataTracker, ContentTracker, AclTracker, CascadeTracker, CommitTracker, and ModelTracker.
  • The ContentTracker is responsible for detecting nodes with new or changed content and triggering the text extraction pipeline.
  • Each Solr core or shard has a singleton instance of every tracker type, registered and scheduled at core startup.

The ContentTracker queries the repository for documents marked as dirty or new and requests text renditions for them. Those renditions are produced by the Transform Service in enterprise deployments or by local transforms in community deployments. Once the plain text is returned, Solr indexes it in the content field.

In newer versions, the traditional ContentTracker has been complemented by an AsyncContentTracker that integrates with the Transform Service via a message queue. This async model decouples content tracking from the synchronous repository polling loop, allowing text renditions to be produced asynchronously and consumed when ready. The repository can also be configured to store text renditions permanently, which avoids redundant transforms during re-indexing operations.

Search Enterprise (Elasticsearch/OpenSearch)

Search Enterprise decouples indexing from the repository using an event-driven, message-based architecture:

  • The Alfresco Repository emits node events to a durable ActiveMQ topic.
  • The repository itself applies a first level of event filtering through repo.event2.filter.* properties.
  • The Elasticsearch Connector subscribes to that topic.
  • Inside the connector, a component called the Mediator processes incoming events.
  • The Mediator applies the mediation filter and then dispatches messages to separate queues for metadata, content, and path indexing.

For the purpose of Transform Service load, the important point is that content indexing events are sent to a dedicated content event channel. If a node is excluded from content indexing by the mediation filter, no content message is produced and no transform request is generated.

Configuration Options to Control Content Indexing

There are several places where content indexing can be controlled depending on the search subsystem and the desired granularity.

Level 1: Repository Model Control with cm:indexControl

At repository level, Alfresco provides the cm:indexControl aspect, which allows indexing behaviour to be controlled on a per-node basis.

Two properties are especially relevant:

  • cm:isIndexed: controls whether the node is indexed at all.
  • cm:isContentIndexed: controls whether the binary content is transformed and indexed for full-text search.

If cm:isContentIndexed=false, the node metadata remains searchable, but the binary content is not transformed and no full-text content is added to the search index.

For a more scalable approach, a custom aspect can extend cm:indexControl with defaults already set and be made mandatory on a custom type:

<aspect name="my:noContentIndex">
    <title>Disable Content Indexing</title>
    <parent>cm:indexControl</parent>
    <overrides>
        <property name="cm:isIndexed">
            <default>true</default>
        </property>
        <property name="cm:isContentIndexed">
            <default>false</default>
        </property>
    </overrides>
</aspect>

<type name="my:scannedDocument">
    <title>Scanned Document</title>
    <parent>cm:content</parent>
    <mandatory-aspects>
        <aspect>my:noContentIndex</aspect>
    </mandatory-aspects>
</type>

Every node of type my:scannedDocument will automatically have content indexing disabled. Metadata such as name, title, and custom properties remains fully searchable.

One important limitation is that cm:indexControl is not hierarchical. It must be present on every individual node to be excluded. It does not propagate to child nodes in a folder hierarchy.

Level 2: Solr Core Properties for Search Services and Insight Engine

For a global approach that applies to all content across the repository, Solr core properties can be used.

Disable all content indexing globally

# solrcore.properties
alfresco.index.transformContent=false

When alfresco.index.transformContent is set to false, the ContentTracker does not request text renditions. Only metadata is indexed. This is the most impactful single setting for reducing Transform Service load in Solr-based deployments.

Disable metadata indexing for the content datatype

alfresco.ignore.datatype.1=d:content

This goes further by telling Solr to ignore the d:content datatype during metadata indexing as well. Combined with alfresco.index.transformContent=false, both content and content-related metadata are excluded.

Throttle ContentTracker load

Even when content indexing remains enabled, ContentTracker behaviour can be tuned:

alfresco.content.tracker.maxParallelism=32
alfresco.contentUpdateBatchSize=2000

Reducing maxParallelism limits how many concurrent text rendition requests are generated. Reducing contentUpdateBatchSize spreads transform work across more cycles. These settings do not eliminate transform work, but they help smooth peaks.

Verify tracker status

GET http://localhost:8983/solr/admin/cores?action=SUMMARY

The response includes tracker state such as:

<bool name="ContentTracker Enabled">true</bool>
<bool name="MetadataTracker Enabled">true</bool>

It is also possible to disable all tracking at runtime:

GET http://localhost:8983/solr/admin/cores?action=disable-indexing

And re-enable it:

GET http://localhost:8983/solr/admin/cores?action=enable-indexing

After changing alfresco.index.transformContent, a full re-index is required for the new behaviour to apply to already indexed content.

Level 3: Mediation Filter in Search Enterprise

Search Enterprise provides a mediation filter that can exclude nodes or fields from indexing.

The filter supports four categories:

  • nodeTypes: skip the node entirely. No metadata, content, or path indexing.
  • contentNodeTypes: keep metadata and path indexing, but skip content indexing.
  • nodeAspects: skip any node that has one of these aspects.
  • fields: remove specific metadata fields before indexing.

The key setting for Transform Service load reduction is contentNodeTypes.

Disable content indexing for specific types

mediation:
  nodeTypes:
  contentNodeTypes:
    - my:scannedDocument
    - my:generatedReport
    - my:logFile
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken

Disable content indexing for all content

mediation:
  nodeTypes:
  contentNodeTypes:
    - cm:content
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken

Because all content types extend cm:content, adding cm:content to contentNodeTypes disables full-text indexing for every document in the repository while preserving metadata search. No content messages are dispatched to the content event channel, so zero transform requests are generated for content indexing.

Configure filter file location

  • Application property: alfresco.mediation.filter-file=classpath:mediation-filter.yml
  • System property: -Dalfresco.mediation.filter-file=/path/to/custom-mediation-filter.yml
  • Environment variable: ALFRESCO_MEDIATION_FILTER_FILE

Both Live Indexing and Re-indexing must point to the same filter configuration. If they use different filters, the index state becomes inconsistent.

Scale the connector when content indexing stays enabled

ACTIVEMQ_POOL_ENABLED=true
ACTIVEMQ_POOL_SIZE=100

This increases the number of consumers processing metadata, content, and path messages in parallel.

How Content Indexing Impacts Transform Service Performance

When content indexing is enabled, every new or updated document triggers a transform request for a text rendition. That sequence usually follows these steps:

  1. Transform request is issued by the search subsystem.
  2. Transform Router or local subsystem chooses the appropriate transformer.
  3. A Transform Core worker performs the actual conversion.
  4. The resulting plain text is returned for indexing.

The transform execution step is the expensive one in CPU and memory terms.

Resource consumption patterns

Document Type Typical Transform Cost Notes
Plain text (.txt, .csv, .xml) Very low Minimal processing, almost passthrough
Office documents (.docx, .xlsx, .pptx) Medium Requires LibreOffice-based extraction
PDF documents Medium to high Depends on page count and OCR requirements
Images with OCR Very high OCR is extremely CPU-intensive
Large binaries (.zip, .iso, video) Wasted Usually fail to transform or produce empty text

Quantifying the impact

Consider a repository receiving 10,000 new documents per day. If 40% are scanned TIFF files processed with OCR and each takes about 30 seconds of CPU time, that results in approximately 33 hours of CPU time per day just for content indexing transforms. Disabling content indexing for those nodes removes that entire workload.

The Transform Service will still handle other renditions such as thumbnails and previews, but full-text extraction transforms are typically among the heaviest operations in the pipeline.

Observable symptoms of transform overload

  • Growing ActiveMQ queue depth on transform-related queues
  • High CPU utilization on Transform Core worker containers
  • Increased memory consumption on workers processing large documents
  • Indexing lag between upload and searchability
  • Transform timeouts that leave content unindexed

What to monitor before and after

  • Transform Service CPU usage, memory usage, active worker count, queue depth, average transform duration, and timeout rate
  • Search indexing throughput, index size growth rate, and indexing lag
  • Repository node creation and update rates to ensure the workload is comparable

Conclusion

Content indexing is a primary driver of Transform Service workload in Alfresco deployments. Each content-indexed node generates a transform request to convert binary content to plain text. By understanding the configuration options available at repository level, Solr level, and Search Enterprise level, it is possible to control precisely which documents undergo text extraction and which are indexed by metadata alone.

In many repositories, selectively disabling content indexing can substantially reduce Transform Service CPU consumption while preserving the ability for users to find documents by name, type, title, date, and custom properties. For systems under heavy transform load, this is often one of the most effective optimization levers available.

References