Every document uploaded to an Alfresco repository that is configured for full-text search goes through a transformation pipeline: the binary content is converted to plain text, and that text is then pushed into the search index. This means that every content-indexed node generates work for the Transform Service. In enterprise deployments, this involves remote and asynchronous transforms via Transform Core workers and Transform Router. In community deployments, transforms run locally and synchronously within the repository process. When this is multiplied by thousands or millions of documents, the performance implications become significant.
This post provides a technical walkthrough of how content indexing works across the three Alfresco search subsystems: Search Services, Search and Insight Engine, and Search Enterprise. And how it can be tuned or disabled to reduce the load on Transform infrastructure.
Before diving into configuration, it helps to understand the data flow from content upload to searchable index entry.
Regardless of which search subsystem is used, the fundamental pipeline is the same:
text/plain.The expensive step is the transformation to plain text. Transforming a large PDF or a complex Office document is CPU and memory intensive. Transforming thousands of them concurrently creates sustained load on Transform workers, queue backlogs, and memory pressure. Disabling content indexing for nodes where full-text search is unnecessary removes that transform work entirely.
Search Services, available for both Community and Enterprise, and Search and Insight Engine, available for Enterprise, are both built on a customized Apache Solr architecture tightly coupled with the repository:
Tracker Subsystem runs inside each Solr instance and periodically polls the repository for changes.MetadataTracker, ContentTracker, AclTracker, CascadeTracker, CommitTracker, and ModelTracker.ContentTracker is responsible for detecting nodes with new or changed content and triggering the text extraction pipeline.The ContentTracker queries the repository for documents marked as dirty or new and requests text renditions for them. Those renditions are produced by the Transform Service in enterprise deployments or by local transforms in community deployments. Once the plain text is returned, Solr indexes it in the content field.
In newer versions, the traditional ContentTracker has been complemented by an AsyncContentTracker that integrates with the Transform Service via a message queue. This async model decouples content tracking from the synchronous repository polling loop, allowing text renditions to be produced asynchronously and consumed when ready. The repository can also be configured to store text renditions permanently, which avoids redundant transforms during re-indexing operations.
Search Enterprise decouples indexing from the repository using an event-driven, message-based architecture:
repo.event2.filter.* properties.Mediator processes incoming events.For the purpose of Transform Service load, the important point is that content indexing events are sent to a dedicated content event channel. If a node is excluded from content indexing by the mediation filter, no content message is produced and no transform request is generated.
There are several places where content indexing can be controlled depending on the search subsystem and the desired granularity.
cm:indexControlAt repository level, Alfresco provides the cm:indexControl aspect, which allows indexing behaviour to be controlled on a per-node basis.
Two properties are especially relevant:
cm:isIndexed: controls whether the node is indexed at all.cm:isContentIndexed: controls whether the binary content is transformed and indexed for full-text search.If cm:isContentIndexed=false, the node metadata remains searchable, but the binary content is not transformed and no full-text content is added to the search index.
For a more scalable approach, a custom aspect can extend cm:indexControl with defaults already set and be made mandatory on a custom type:
<aspect name="my:noContentIndex">
<title>Disable Content Indexing</title>
<parent>cm:indexControl</parent>
<overrides>
<property name="cm:isIndexed">
<default>true</default>
</property>
<property name="cm:isContentIndexed">
<default>false</default>
</property>
</overrides>
</aspect>
<type name="my:scannedDocument">
<title>Scanned Document</title>
<parent>cm:content</parent>
<mandatory-aspects>
<aspect>my:noContentIndex</aspect>
</mandatory-aspects>
</type>
Every node of type my:scannedDocument will automatically have content indexing disabled. Metadata such as name, title, and custom properties remains fully searchable.
One important limitation is that cm:indexControl is not hierarchical. It must be present on every individual node to be excluded. It does not propagate to child nodes in a folder hierarchy.
For a global approach that applies to all content across the repository, Solr core properties can be used.
# solrcore.properties
alfresco.index.transformContent=false
When alfresco.index.transformContent is set to false, the ContentTracker does not request text renditions. Only metadata is indexed. This is the most impactful single setting for reducing Transform Service load in Solr-based deployments.
alfresco.ignore.datatype.1=d:content
This goes further by telling Solr to ignore the d:content datatype during metadata indexing as well. Combined with alfresco.index.transformContent=false, both content and content-related metadata are excluded.
Even when content indexing remains enabled, ContentTracker behaviour can be tuned:
alfresco.content.tracker.maxParallelism=32
alfresco.contentUpdateBatchSize=2000
Reducing maxParallelism limits how many concurrent text rendition requests are generated. Reducing contentUpdateBatchSize spreads transform work across more cycles. These settings do not eliminate transform work, but they help smooth peaks.
GET http://localhost:8983/solr/admin/cores?action=SUMMARY
The response includes tracker state such as:
<bool name="ContentTracker Enabled">true</bool>
<bool name="MetadataTracker Enabled">true</bool>
It is also possible to disable all tracking at runtime:
GET http://localhost:8983/solr/admin/cores?action=disable-indexing
And re-enable it:
GET http://localhost:8983/solr/admin/cores?action=enable-indexing
After changing alfresco.index.transformContent, a full re-index is required for the new behaviour to apply to already indexed content.
Search Enterprise provides a mediation filter that can exclude nodes or fields from indexing.
The filter supports four categories:
nodeTypes: skip the node entirely. No metadata, content, or path indexing.contentNodeTypes: keep metadata and path indexing, but skip content indexing.nodeAspects: skip any node that has one of these aspects.fields: remove specific metadata fields before indexing.The key setting for Transform Service load reduction is contentNodeTypes.
mediation:
nodeTypes:
contentNodeTypes:
- my:scannedDocument
- my:generatedReport
- my:logFile
nodeAspects:
- sys:hidden
fields:
- cmis:changeToken
mediation:
nodeTypes:
contentNodeTypes:
- cm:content
nodeAspects:
- sys:hidden
fields:
- cmis:changeToken
Because all content types extend cm:content, adding cm:content to contentNodeTypes disables full-text indexing for every document in the repository while preserving metadata search. No content messages are dispatched to the content event channel, so zero transform requests are generated for content indexing.
alfresco.mediation.filter-file=classpath:mediation-filter.yml-Dalfresco.mediation.filter-file=/path/to/custom-mediation-filter.ymlALFRESCO_MEDIATION_FILTER_FILEBoth Live Indexing and Re-indexing must point to the same filter configuration. If they use different filters, the index state becomes inconsistent.
ACTIVEMQ_POOL_ENABLED=true
ACTIVEMQ_POOL_SIZE=100
This increases the number of consumers processing metadata, content, and path messages in parallel.
When content indexing is enabled, every new or updated document triggers a transform request for a text rendition. That sequence usually follows these steps:
The transform execution step is the expensive one in CPU and memory terms.
| Document Type | Typical Transform Cost | Notes |
|---|---|---|
| Plain text (.txt, .csv, .xml) | Very low | Minimal processing, almost passthrough |
| Office documents (.docx, .xlsx, .pptx) | Medium | Requires LibreOffice-based extraction |
| PDF documents | Medium to high | Depends on page count and OCR requirements |
| Images with OCR | Very high | OCR is extremely CPU-intensive |
| Large binaries (.zip, .iso, video) | Wasted | Usually fail to transform or produce empty text |
Consider a repository receiving 10,000 new documents per day. If 40% are scanned TIFF files processed with OCR and each takes about 30 seconds of CPU time, that results in approximately 33 hours of CPU time per day just for content indexing transforms. Disabling content indexing for those nodes removes that entire workload.
The Transform Service will still handle other renditions such as thumbnails and previews, but full-text extraction transforms are typically among the heaviest operations in the pipeline.
Content indexing is a primary driver of Transform Service workload in Alfresco deployments. Each content-indexed node generates a transform request to convert binary content to plain text. By understanding the configuration options available at repository level, Solr level, and Search Enterprise level, it is possible to control precisely which documents undergo text extraction and which are indexed by metadata alone.
In many repositories, selectively disabling content indexing can substantially reduce Transform Service CPU consumption while preserving the ability for users to find documents by name, type, title, date, and custom properties. For systems under heavy transform load, this is often one of the most effective optimization levers available.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.