cancel
Showing results for 
Search instead for 
Did you mean: 

Potential tracking inefficiency by the Alfresco-Solr4

j_fintora
Champ in-the-making
Champ in-the-making

Hi,

we use alfresco-solr4-5.2.e with Alfresco 5.2.e and we found potential inefficiency.

Solr tracks changes in an associated Alfresco system by periodically requesting info about committed transactions. The tracking is triggered every 15 seconds (property alfresco.cron=0/15 * * * * ? *) and every time it requests transactions back to the time of the last committed transaction (actually hour before that -> alfresco.hole.retention=3600000).

One tracking sends many requests. One for each hour between now and time of the last committed transaction.

And we found out that all hour intervals since the last committed transaction are queried over and over again every 15 seconds. For example, when I upload a file to the Alfresco and then wait few hours, the number of tracking requests will grow by one for each hour since the upload. And the same requests will be fired every 15 seconds until I upload another file.

And we wonder, why it need to query the same interval more than once? Even when the interval is in the past. Is it inefficiency, or is there some reason behind that?

The problem is partially mentioned here, but there is nothing about the repeated querying of the same time interval.

Some insight would be appreciated. Thank you.

1 ACCEPTED ANSWER

afaust
Legendary Innovator
Legendary Innovator

There is no state management of "when" SOLR has last queried for changes. SOLR only checks based on the last transaction it has found in the index and uses that transaction's commit time as the basis for the interval. So in those cases where nothing has been done in the system, that information is simply lacking.

Is it inefficient? Yes. Has something changed or is something going to be changed? No, and it's not very likely. Alfresco has never been designed or optimised to be an idle system without any user load for long durations of time, and you would require an idle system for this to even manifest itself. On the other hand - apart from spamming the access logs - these additional requests should be negligible in effective cost to the system. The DB query simply yields no result and the request is done in a hand full of milliseconds.

Feel free to file an issue in the Alfresco JIRA to log this as a bug. Any discussion here in this platform does not automatically lead to such topics being tracked as something to be fixed...

View answer in original post

11 REPLIES 11

afaust
Legendary Innovator
Legendary Innovator

There is no state management of "when" SOLR has last queried for changes. SOLR only checks based on the last transaction it has found in the index and uses that transaction's commit time as the basis for the interval. So in those cases where nothing has been done in the system, that information is simply lacking.

Is it inefficient? Yes. Has something changed or is something going to be changed? No, and it's not very likely. Alfresco has never been designed or optimised to be an idle system without any user load for long durations of time, and you would require an idle system for this to even manifest itself. On the other hand - apart from spamming the access logs - these additional requests should be negligible in effective cost to the system. The DB query simply yields no result and the request is done in a hand full of milliseconds.

Feel free to file an issue in the Alfresco JIRA to log this as a bug. Any discussion here in this platform does not automatically lead to such topics being tracked as something to be fixed...

j_fintora
Champ in-the-making
Champ in-the-making

Thank you for your answer.

Well, thank you for your explanation, although I don't still fully get your point (see below). So before I file a bug, I would like to ask you (or anyone else) here, maybe I could overlook something... This topic is clearly going around for many years (since 2012, at minimum, see SOLR causes high CPU usage on idle repo. ), but no one is actually doing anything about it. I personally don't find answers like "disable your access log" or "just upload to your Alfresco something at least once in a day or two" as real solutions.

So, my question is whether the Solr implementation can be really considered as a sane one, provided that there are the following observations:

  1. If one doesn't touch Alfresco, the size of a daily access log grows by approx. 140M every day.
  2. There is an evident shortcoming seen in the querying mechanism, which causes the "over-querying", as described above. Maybe it should be reformulated like this: When Alfresco and Solr both know that the last transaction happened at 4pm yesterday, why on earth is Solr querying Alfresco for transactions also after that moment until now? What would be the benefit of such a behavior?

afaust
Legendary Innovator
Legendary Innovator

"No one is actually doing anything about it" - For a long time the contribution process was so cumbersome / ineffective that only Alfresco engineers could have been doing anything about it, and for them it did not end up being a top priority. In most production environments this has not bee a relevant issue / topiic, so customers apparently did not report this sufficiently often enough for it to become a priority. 140 M of highly compressable log file can be dealt with easily with logrotate. And if you really wanted you could separate SOLR tracking requests from others before rolling over and compressing logs.

Maybe ‌ or ‌ could comment on this (Andy also participated in that old forum thread you linked back on the old forum platform).