Speed up Community Alfresco with Lucene repository properties - help needed

joko71 — Tue, 13 Dec 2022 08:21:19 GMT

Hi,

this is an old product, started on Community 2.1, reached Community 4.2.c. We could not advance further because our solution was too much interconnected with the old Alfresco architecture.

We run it on CentOS 5.6 VM (all on single machine) with 4 processors and 32 Gb memory. We have more than 2 million documents and 12 Gb Lucene indexes, servicing around 450 users. System is a bit sluggish, we're trying to speed it up, and one of the ways is thru repository setup.

We left mostly repository properties untouched, but I think we could get some improvement with better Lucene settings. Yes, we don't use Solr. I don't know if that would work better, either putting it on the same or separate machine.

We disabled content indexing in our content model, and I was surprised to find this

lucene.indexer.contentIndexingEnabled=true

in our properties. We can safely put it to false, yes?

Our documents have no more than 150 fields.

I'm interested in stop words and analyzers. That could help.

What about query cache? Our users have some query repetition, but not much. Is it better to have it on or off regarding performance?

Regarding index files, we have one big chunk 10 Gb big. Is there a way to tell Lucene to divide it in smaller files. Maybe I'm wrong, but wouldn't be easier to Lucene to work with smaller files?

And merge parameters. How much they're important while production is used by users, and how much at night when index business happens?

Well, I hope you can help us. You will make us and our users very happy with your help.

Here's our repository properties (some entries were removed due to post length, assume default values if needed):

# Repository configuration

repository.name=Main Repository

# Directory configuration

dir.root=/var/alfresco/data

dir.contentstore=${dir.root}/contentstore
dir.contentstore.deleted=${dir.root}/contentstore.deleted

# The location of cached content
dir.cachedcontent=${dir.root}/cachedcontent

dir.auditcontentstore=${dir.root}/audit.contentstore

# The value for the maximum permitted size in bytes of all content.
# No value (or a negative long) will be taken to mean that no limit should be applied.
# See content-services-context.xml
system.content.maximumFileSizeLimit=

# The location for lucene index files
dir.indexes=${dir.root}/lucene-indexes

# The location for index backups
dir.indexes.backup=${dir.root}/backup-lucene-indexes

# The location for lucene index locks
dir.indexes.lock=${dir.indexes}/locks

#Directory to find external license
dir.license.external=.
# Spring resource location of external license files
location.license.external=file://${dir.license.external}/*.lic
# Spring resource location of embedded license files    
location.license.embedded=/WEB-INF/alfresco/license/*.lic
# Spring resource location of license files on shared classpath
location.license.shared=classpath*:/alfresco/extension/license/*.lic

# WebDAV initialization properties
system.webdav.servlet.enabled=true
system.webdav.url.path.prefix=
system.webdav.storeName=${protocols.storeName}
system.webdav.rootPath=${protocols.rootPath}
system.webdav.activities.enabled=true
# File name patterns that trigger rename shuffle detection
# pattern is used by move - tested against full path after it has been lower cased.
system.webdav.renameShufflePattern=(.*/\\..*)|(.*[a-f0-9]{8}+$)|(.*\\.tmp$)|(.*\\.wbk$)|(.*\\.bak$)|(.*\\~$)


# Is the JBPM Deploy Process Servlet enabled?
# Default is false. Should not be enabled in production environments as the
# servlet allows unauthenticated deployment of new workflows.
system.workflow.deployservlet.enabled=true

# Sets the location for the JBPM Configuration File
system.workflow.jbpm.config.location=classpath:org/alfresco/repo/workflow/jbpm/jbpm.cfg.xml 

# Determines if JBPM workflow definitions are shown.
# Default is false. This controls the visibility of JBPM 
# workflow definitions from the getDefinitions and 
# getAllDefinitions WorkflowService API but still allows 
# any in-flight JBPM workflows to be completed.
system.workflow.engine.jbpm.definitions.visible=true

#Determines if Activiti definitions are visible
system.workflow.engine.activiti.definitions.visible=true

# Determines if the JBPM engine is enabled
system.workflow.engine.jbpm.enabled=true

# Determines if the Activiti engine is enabled
system.workflow.engine.activiti.enabled=true

index.subsystem.name=lucene

# ######################################### #
# Index Recovery and Tracking Configuration #
# ######################################### #
#
# Recovery types are:
#    NONE:     Ignore
#    VALIDATE: Checks that the first and last transaction for each store is represented in the indexes
#    AUTO:     Validates and auto-recovers if validation fails
#    FULL:     Full index rebuild, processing all transactions in order.  The server is temporarily suspended.
index.recovery.mode=AUTO
# FULL recovery continues when encountering errors
index.recovery.stopOnError=false
index.recovery.maximumPoolSize=5
# Set the frequency with which the index tracking is triggered.
# For more information on index tracking in a cluster:
#    http://wiki.alfresco.com/wiki/High_Availability_Configuration_V1.4_to_V2.1#Version_1.4.5.2C_2.1.1_and_later
# By default, this is effectively never, but can be modified as required.
#    Examples:
#       Never:                   * * * * * ? 2099
#       Once every five seconds: 0/5 * * * * ?
#       Once every two seconds : 0/2 * * * * ?
#       See http://www.quartz-scheduler.org/docs/tutorials/crontrigger.html
index.tracking.cronExpression=0/5 * * * * ?
index.tracking.adm.cronExpression=${index.tracking.cronExpression}
index.tracking.avm.cronExpression=${index.tracking.cronExpression}
# Other properties.
index.tracking.maxTxnDurationMinutes=10
index.tracking.reindexLagMs=1000
index.tracking.maxRecordSetSize=1000
index.tracking.maxTransactionsPerLuceneCommit=100
index.tracking.disableInTransactionIndexing=false
# Index tracking information of a certain age is cleaned out by a scheduled job.
# Any clustered system that has been offline for longer than this period will need to be seeded
# with a more recent backup of the Lucene indexes or the indexes will have to be fully rebuilt.
# Use -1 to disable purging.  This can be switched on at any stage.
index.tracking.minRecordPurgeAgeDays=30
# Unused transactions will be purged in chunks determined by commit time boundaries. 'index.tracking.purgeSize' specifies the size
# of the chunk (in ms). Default is a couple of hours.
index.tracking.purgeSize=7200000

# Reindexing of missing content is by default 'never' carried out.
# The cron expression below can be changed to control the timing of this reindexing.
# Users of Enterprise Alfresco can configure this cron expression via JMX without a server restart.
# Note that if alfresco.cluster.name is not set, then reindexing will not occur.
index.reindexMissingContent.cronExpression=* * * * * ? 2099

# Change the failure behaviour of the configuration checker
system.bootstrap.config_check.strict=true

#
# How long should shutdown wait to complete normally before 
# taking stronger action and calling System.exit()
# in ms, 10,000 is 10 seconds
#
shutdown.backstop.timeout=10000
shutdown.backstop.enabled=false

# Server Single User Mode
# note:
#   only allow named user (note: if blank or not set then will allow all users)
#   assuming maxusers is not set to 0
#server.singleuseronly.name=admin

# Server Max Users - limit number of users with non-expired tickets
# note: 
#   -1 allows any number of users, assuming not in single-user mode
#   0 prevents further logins, including the ability to enter single-user mode
server.maxusers=-1

# The Cron expression controlling the frequency with which the OpenOffice connection is tested
openOffice.test.cronExpression=0 * * * * ?

#
# Disable all shared caches (mutable and immutable)
#    These properties are used for diagnostic purposes
system.cache.disableMutableSharedCaches=false
system.cache.disableImmutableSharedCaches=false

# The maximum capacity of the parent assocs cache (the number of nodes whose parents can be cached)
system.cache.parentAssocs.maxSize=130000

# The average number of parents expected per cache entry. This parameter is multiplied by the above
# value to compute a limit on the total number of cached parents, which will be proportional to the
# cache's memory usage. The cache will be pruned when this limit is exceeded to avoid excessive
# memory usage.
system.cache.parentAssocs.limitFactor=8

#
# Properties to limit resources spent on individual searches
#
# The maximum time spent pruning results
system.acl.maxPermissionCheckTimeMillis=10000
# The maximum number of search results to perform permission checks against
system.acl.maxPermissionChecks=1000

# The maximum number of filefolder list results
system.filefolderservice.defaultListMaxResults=5000

# Properties to control read permission evaluation for acegi
system.readpermissions.optimise=true
system.readpermissions.bulkfetchsize=1000

#
# Manually control how the system handles maximum string lengths.
# Any zero or negative value is ignored.
# Only change this after consulting support or reading the appropriate Javadocs for
# org.alfresco.repo.domain.schema.SchemaBootstrap for V2.1.2
system.maximumStringLength=-1

#
# Limit hibernate session size by trying to amalgamate events for the L2 session invalidation
# - hibernate works as is up to this size 
# - after the limit is hit events that can be grouped invalidate the L2 cache by type and not instance
# events may not group if there are post action listener registered (this is not the case with the default distribution)
system.hibernateMaxExecutions=20000

#
# Determine if modification timestamp propagation from child to parent nodes is respected or not.
# Even if 'true', the functionality is only supported for child associations that declare the
# 'propagateTimestamps' element in the dictionary definition.
system.enableTimestampPropagation=true

#
# Decide if content should be removed from the system immediately after being orphaned.
# Do not change this unless you have examined the impact it has on your backup procedures.
system.content.eagerOrphanCleanup=false
# The number of days to keep orphaned content in the content stores.
#    This has no effect on the 'deleted' content stores, which are not automatically emptied.
system.content.orphanProtectDays=14
# The action to take when a store or stores fails to delete orphaned content
#    IGNORE: Just log a warning.  The binary remains and the record is expunged
#    KEEP_URL: Log a warning and create a URL entry with orphan time 0.  It won't be processed or removed.
system.content.deletionFailureAction=IGNORE
# The CRON expression to trigger the deletion of resources associated with orphaned content.
system.content.orphanCleanup.cronExpression=0 0 4 * * ?
# The CRON expression to trigger content URL conversion.  This process is not intesive and can
#    be triggered on a live system.  Similarly, it can be triggered using JMX on a dedicated machine.
system.content.contentUrlConverter.cronExpression=* * * * * ? 2099
system.content.contentUrlConverter.threadCount=2
system.content.contentUrlConverter.batchSize=500
system.content.contentUrlConverter.runAsScheduledJob=false

# #################### #
# Lucene configuration #
# #################### #
#
# Millisecond threshold for text transformations
# Slower transformers will force the text extraction to be asynchronous
#
lucene.maxAtomicTransformationTime=100
#
# The maximum number of clauses that are allowed in a lucene query 
#
lucene.query.maxClauses=10000
#
# The size of the queue of nodes waiting for index
# Events are generated as nodes are changed, this is the maximum size of the queue used to coalesce event
# When this size is reached the lists of nodes will be indexed
#
# http://issues.alfresco.com/browse/AR-1280:  Setting this high is the workaround as of 1.4.3. 
#
lucene.indexer.batchSize=1000000
fts.indexer.batchSize=1000
#
# Index cache sizes
#
lucene.indexer.cacheEnabled=true
lucene.indexer.maxDocIdCacheSize=100000
lucene.indexer.maxDocumentCacheSize=100
lucene.indexer.maxIsCategoryCacheSize=-1
lucene.indexer.maxLinkAspectCacheSize=10000
lucene.indexer.maxParentCacheSize=100000
lucene.indexer.maxPathCacheSize=100000
lucene.indexer.maxTypeCacheSize=10000
#
# Properties for merge (not this does not affect the final index segment which will be optimised) 
# Max merge docs only applies to the merge process not the resulting index which will be optimised.
#
lucene.indexer.mergerMaxMergeDocs=1000000
lucene.indexer.mergerMergeFactor=5
lucene.indexer.mergerMaxBufferedDocs=-1
#lucene.indexer.mergerRamBufferSizeMb=16
lucene.indexer.mergerRamBufferSizeMb=20

#
# Properties for delta indexes (not this does not affect the final index segment which will be optimised) 
# Max merge docs only applies to the index building process not the resulting index which will be optimised.
#
lucene.indexer.writerMaxMergeDocs=1000000
lucene.indexer.writerMergeFactor=5
lucene.indexer.writerMaxBufferedDocs=-1
#lucene.indexer.writerRamBufferSizeMb=16
lucene.indexer.writerRamBufferSizeMb=20

#
# Target number of indexes and deltas in the overall index and what index size to merge in memory
#
lucene.indexer.mergerTargetIndexCount=8
lucene.indexer.mergerTargetOverlayCount=5
lucene.indexer.mergerTargetOverlaysBlockingFactor=2
lucene.indexer.maxDocsForInMemoryMerge=60000
lucene.indexer.maxRamInMbForInMemoryMerge=16
lucene.indexer.maxDocsForInMemoryIndex=60000
#lucene.indexer.maxRamInMbForInMemoryIndex=16
lucene.indexer.maxRamInMbForInMemoryIndex=20

#
# Other lucene properties
#
lucene.indexer.termIndexInterval=128
lucene.indexer.useNioMemoryMapping=true
# over-ride to false for pre 3.0 behaviour
lucene.indexer.postSortDateTime=true
lucene.indexer.defaultMLIndexAnalysisMode=EXACT_LANGUAGE_AND_ALL
lucene.indexer.defaultMLSearchAnalysisMode=EXACT_LANGUAGE_AND_ALL
#
# The number of terms from a document that will be indexed
#
lucene.indexer.maxFieldLength=10000

# Should we use a 'fair' locking policy, giving queue-like access behaviour to
# the indexes and avoiding starvation of waiting writers? Set to false on old
# JVMs where this appears to cause deadlock
lucene.indexer.fairLocking=true

#
# Index locks (mostly deprecated and will be tidied up with the next lucene upgrade)
#
lucene.write.lock.timeout=10000
lucene.commit.lock.timeout=100000
lucene.lock.poll.interval=100

lucene.indexer.useInMemorySort=true
lucene.indexer.maxRawResultSetSizeForInMemorySort=1000
lucene.indexer.contentIndexingEnabled=true

index.backup.cronExpression=0 0 3 * * ?

lucene.defaultAnalyserResourceBundleName=alfresco/model/dataTypeAnalyzers


# When transforming archive files (.zip etc) into text representations (such as
#  for full text indexing), should the files within the archive be processed too?
# If enabled, transformation takes longer, but searches of the files find more.
transformer.Archive.includeContents=false

# Database configuration
db.schema.stopAfterSchemaBootstrap=false
db.schema.update=true
db.schema.update.lockRetryCount=24
db.schema.update.lockRetryWaitSeconds=5
db.driver=org.gjt.mm.mysql.Driver
db.name=alfresco
db.url=jdbc:mysql:///${db.name}
db.username=alfresco
db.password=*
db.pool.initial=10
db.pool.max=40
db.txn.isolation=-1
db.pool.statements.enable=true
db.pool.statements.max=40
db.pool.min=0
db.pool.idle=-1
db.pool.wait.max=-1
db.pool.validate.query=
db.pool.evict.interval=-1
db.pool.evict.idle.min=1800000
db.pool.validate.borrow=true
db.pool.validate.return=false
db.pool.evict.validate=false
#
db.pool.abandoned.detect=false
db.pool.abandoned.time=300
#
# db.pool.abandoned.log=true (logAbandoned) adds overhead (http://commons.apache.org/dbcp/configuration.html)
# and also requires db.pool.abandoned.detect=true (removeAbandoned)
#
db.pool.abandoned.log=false





#
# Caching Content Store
#
system.content.caching.cacheOnInbound=true
system.content.caching.maxDeleteWatchCount=1
# Clean up every day at 3 am
system.content.caching.contentCleanup.cronExpression=0 0 3 * * ?
system.content.caching.minFileAgeMillis=60000
system.content.caching.maxUsageMB=4096
# maxFileSizeMB - 0 means no max file size.
system.content.caching.maxFileSizeMB=0

mybatis.useLocalCaches=false

fileFolderService.checkHidden.enabled=true


ticket.cleanup.cronExpression=0 0 * * * ?

#
# Disable load of sample site
#
sample.site.disabled=false

#
# Download Service Cleanup
#
download.cleaner.startDelayMins=60
download.cleaner.repeatIntervalMins=60
download.cleaner.maxAgeMins=60

# enable QuickShare - if false then the QuickShare-specific REST APIs will return 403 Forbidden
system.quickshare.enabled=true

#
# Cache configuration
#
cache.propertyValueCache.maxItems=10000
cache.contentDataSharedCache.maxItems=130000
cache.immutableEntitySharedCache.maxItems=50000
cache.node.rootNodesSharedCache.maxItems=1000
cache.node.allRootNodesSharedCache.maxItems=1000
cache.node.nodesSharedCache.maxItems=250000
cache.node.aspectsSharedCache.maxItems=130000
cache.node.propertiesSharedCache.maxItems=130000
cache.node.parentAssocsSharedCache.maxItems=130000
cache.node.childByNameSharedCache.maxItems=130000
cache.userToAuthoritySharedCache.maxItems=5000
cache.authenticationSharedCache.maxItems=5000
cache.authoritySharedCache.maxItems=10000
cache.authorityToChildAuthoritySharedCache.maxItems=40000
cache.zoneToAuthoritySharedCache.maxItems=500
cache.permissionsAccessSharedCache.maxItems=50000
cache.readersSharedCache.maxItems=10000
cache.readersDeniedSharedCache.maxItems=10000
cache.nodeOwnerSharedCache.maxItems=40000
cache.personSharedCache.maxItems=1000
cache.ticketsCache.maxItems=1000
cache.avmEntitySharedCache.maxItems=5000
cache.avmVersionRootEntitySharedCache.maxItems=1000
cache.avmNodeSharedCache.maxItems=5000
cache.avmNodeAspectsSharedCache.maxItems=5000
cache.webServicesQuerySessionSharedCache.maxItems=1000
cache.aclSharedCache.maxItems=50000
cache.aclEntitySharedCache.maxItems=50000
cache.resourceBundleBaseNamesSharedCache.maxItems=1000
cache.loadedResourceBundlesSharedCache.maxItems=1000
cache.messagesSharedCache.maxItems=1000
cache.compiledModelsSharedCache.maxItems=1000
cache.prefixesSharedCache.maxItems=1000
cache.webScriptsRegistrySharedCache.maxItems=1000
cache.routingContentStoreSharedCache.maxItems=10000
cache.executingActionsCache.maxItems=1000
cache.tagscopeSummarySharedCache.maxItems=1000
cache.imapMessageSharedCache.maxItems=2000
cache.tenantEntitySharedCache.maxItems=1000
cache.immutableSingletonSharedCache.maxItems=12000
cache.remoteAlfrescoTicketService.ticketsCache.maxItems=1000
cache.contentDiskDriver.fileInfoCache.maxItems=1000
cache.globalConfigSharedCache.maxItems=1000
cache.authorityBridgeTableByTenantSharedCache.maxItems=10

#
# Download Service Limits, in bytes
#
download.maxContentSize=2152852358


#
# Use bridge tables for caching authority evaluation.
#
authority.useBridgeTable=true

Re: Speed up Community Alfresco with Lucene repository properties - help needed

joko71 — Wed, 14 Dec 2022 08:44:11 GMT

Our progress so far:

we found out that our version of Lucene is 2.4.1
we found out that Luke version 3.5, tool for Lucene index analysis, works with Lucene 2.4.1
we found out that we can extend Lucene analyzer for our language, stop words, etc.

Question: how can we tell Alfresco to use our analyzer?

topic Speed up Community Alfresco with Lucene repository properties - help needed in Alfresco Forum

Speed up Community Alfresco with Lucene repository properties - help needed

Re: Speed up Community Alfresco with Lucene repository properties - help needed