cancel
Showing results for 
Search instead for 
Did you mean: 

Solr Indexing Queue In Laymans Terms

motionpotion
Champ in-the-making
Champ in-the-making
Could somebody explain how Solr indexing works in terms of a large queue of documents?

E.g. If Solr is stopped for some reason does it go through in sequence all new documents uploaded/created when it starts or does it go through modified documents?   In your answer could you also reference how transactions that affect millions of nodes are handled in the event where indexing has to catch up.

I'm really looking for a diagram of how it works so I can get it clear in my mind.
2 REPLIES 2

sujaypillai
Confirmed Champ
Confirmed Champ
Hi,

By default on every 15 seconds SOLR tracks for changes on Alfresco side. The query for changes include any changes in content and newly created documents, changes on the content models and for changes on the ACLs for documents in order to index those changes on its cores.

SOLR updates its indexes by looking at the number  of transactions that have been committed since it last talked to Alfresco.

A basic overview of how this works can be found here -
http://docs.alfresco.com/5.0/concepts/solr-overview.html

As shown in the diagram you see http requests going from SOLR to Alfresco:

1. https://localhost:8443/alfresco/service/api/solr/model
SOLR keeps track of new custom content models that have been deployed and download them to be able to index the properties in these models.

2. https://localhost:8443/alfresco/service/api/solr/aclchangesets
Any changes on permission settings will be downloaded by SOLR so it can do query time permission filtering.

3. https://localhost:8443/alfresco/service/api/solr/transactions
Any create, delete, update or any other action triggers a transaction and this is captured at SOLR end by the above URL.

4.  https://localhost:8443/alfresco/service/api/solr/textContent
Any change to a document content is detected by above URL.


You can get the status for SOLR index -
http://localhost:8080/solr4/admin/cores?action=REPORT&wt=xml

So when a SOLR server gets started first it polls to Alfresco server to get a status and depending on this it starts indexing using the above 4 URL's.

Read more about SOLR here -
http://docs.alfresco.com/5.0/concepts/solr-home.html

@Sujay thanks for the info.  Is it possible to identify and remove any large transactions from the queue before they are indexed or during indexing? i.e. transactions above a certain size should not be indexed.  What affect would this have if the transaction were an ACL update, for example?