09-21-2022 05:45 AM
I have encountered an Alfresco 7.2.1 containerized instance in operation as a service layer for an application. The Alfresco UIs are not used, but instead its API is used by the custom application to push, pull and search documents.
The system holds millions of documents. When a document is pushed to Alfresco, it is pushed as a child of a common parent workspace node. For the purpose of organization, let's say node 27 has millions of direct children defined in ALF_CHILD_ASSOC.
When such a new child document is added, Solr then tries seems to initiate a bulk download of all children of node 27. Eventually this call times out, yielding the errors-
SolrInformationServer
Bulk indexing failed,​ do one node at a time. See the stacktrace below for further details.
SolrInformationServer Unable to get nodes metadata from repository using fromNodeId=27,​ toNodeId=27,​ nodeIds=null,​ fromTxId=null,​ toTxId=null,​ txIds=null. See the stacktrace below for further details.
At this point Solr seems to resort to pulling every single child document of 27, one by one, to rationalize with its index.
The loading on the content server during this operation increases during this entire operation, and in this particular sample hits upwards of 50GB of memory consumption (obviously having massively increased the allocation to the container). Memory exhaustion leads to endless garbage collection and churn, etc.
Is this normal and expected behaviour? I understand that Solr would need to understand a new child in the context of a parent, but the massive loading seems suboptimal. Is the use of Alfresco in this manner (with a single level) a problem, and would a file hierarchy prevent this?
Any assistance on this would be hugely valued.
09-21-2022 10:29 AM
Not sure how exactly system is setup at your environment and how much resources you have configured. But in general you should not/never keep that many nodes in single folder. This definitely has impact. Ideally only 2K-3K nodes are adviced in one folder at same level and rather you should implement bucketing.
Take a look at this thread as well: https://hub.alfresco.com/t5/alfresco-content-services-forum/is-there-still-a-limit-on-the-amount-of-...
When you add a new child node in a folder, the new node (content, metadata and acl) and its parent node (folder update timestamp metadata and acl) would be re-indexed.
Explore our Alfresco products with the links below. Use labels to filter content by product module.