One of the most common points of contention I have experienced as an ECM consultant is around the upgrade of very large repositories. It is typically a feared task by IT departments. It is made more complicated when some components are not as upgradeable as others. Alfresco is not immune to this, however, it has been better than other outdated technologies.
One of the biggest points of contention with Alfresco upgrades is upgrading the search engine. Alfresco v4.0 introduced a switch from Apache Lucene to Apache Solr v1. Alfresco v5.0 introduced Solr v4 and Alfresco v5.2 introduced Solr v6 as Alfresco Search Services v1. With each of these changes, the whole repository must be re-indexed. In large repositories, the re-index process may take several weeks or longer. To perform such an upgrade and have no to little impact on the end user is very difficult. However, there are some solutions to this problem. To better understand these upgrade issues, let's first cover the general upgrade process.
When performing a major upgrade to Alfresco, you must shut down all instances of Alfresco. A major upgrade is a move in the minor version number, like v5.1 to v5.2. A minor upgrade is a change in the service pack or hot fix number, like v5.2.0 to v5.2.2. In these latter cases, you still have to shut down Alfresco, but you could theoretically perform a rolling upgrade in a clustered environment. No end user downtime is a distinct advantage of a rolling upgrade.
With any upgrade, major or minor, you must back up the database. This is especially true with major upgrades, as those upgrades will inevitably change the schema. You cannot just downgrade the version of Alfresco with an already upgraded database; you must also restore the database.
If possible, back up the content store. Any repositories should already be backed up incrementally. Some storage mechanisms provide the ability to create a snapshot. A snapshot in this context is a zero-time operation that creates a rollback point. This can be a quick, cheap, and easy way to prepare for an upgrade.
During a hot/online backup, it is important to perform the database backup to completion prior to the content store backup or snapshot. When relying on incremental backups, the backup should be turned off before the upgrade takes place. It can be turned back on after the upgrade is deemed successful.
With upgrades that switch Apache Lucene/Solr versions, do not perform an index backup. Indexes from embedded Lucene vs Solr1 vs Solr4 vs Alfresco Search Services (Solr6) are not interchangeable, so a backup would be worthless. Any time you make the switch, you must re-index the whole repository. There are multiple strategies to perform these index engine switches and they are outlined below.
Among the easiest strategies is to not change the indexing engine or its major version. In these cases, you will want to backup the index in case a rollback is required. You can follow the instructions provided by the official Alfresco documentation. Alfresco/Apache Solr v1, Alfresco/Apache Solr v4, and Alfresco Search Services (Solr v6) each may have their own nuances. Whether or not you are dealing with shards, you will want to back up the core configurations too. The location of that configuration depends on your environment.
Any upgrade to v4.x will support Lucene and Solr1. Any upgrade to Alfresco v5.x will support Solr1 and Solr4. Any upgrade to Alfresco Content Services v5.2 will also support Alfresco Search Services v1.x (Solr6). Eventually a version will not support Solr1 and then Solr4, etc...
This upgrade strategy is not available for large gap upgrades. This means it is not available on an upgrade from Alfresco v3.4 using Lucene to Alfresco v5.x, as the latter does not support Lucene. It is also likely that this strategy is not available on an upgrade from any Alfresco version using Solr1 to Alfresco CS v6.x. In those situations, you have to perform an intermediary or parallel upgrade as covered in other sections below.
During a hot/online backup, it is important to perform the index backup to completion prior to the database backup. In case of a rollback, the index can simply catch up to the restored database. This is easy even if the index is days or weeks behind the database. However, it will be out-of-sync and never be consistent it gets ahead of the database.
The easiest strategy is to just start using the new indexing engine with a blank index. In this case, the end user will receive degraded search results until the full re-index is completed. This may be acceptable in your use case while not in others. In most repositories the indexing takes place in minutes or a couple hours. However, large repositories could take days or weeks or more. It becomes less and less acceptable in those situations.
If you are using a new directory to store the index, there is no need to perform a backup of the existing one. During a hot/online backup, it is important to perform the index backup to completion prior to the database backup. In case of a rollback, the index can simply catch up to the restored database. This is easy even if the index is days or weeks behind the database. However, it will be out-of-sync and never be consistent it gets ahead of the database.
When you are deciding to change the index engine, you can perform what I call a 3 stage upgrade. In this case, do not perform a major upgrade on the Solr version. Instead, upgrade the Alfresco platform and install the new indexing engine alongside the existing one. For instance, if you are using Solr1, upgrade to Alfresco CS 5.2, Alfresco Solr1 v5.2, and install Alfresco Search Services v1.1. The new indexing engine will be empty and not referenced by core Alfresco, having little/no impact on the functioning system. That is stage 1 of the upgrade.
Once upgraded, start a full re-index using the new indexing engine. This just involves creating a new search core or shards. The template configuration should be configured to point to the Alfresco instance so it can track it. The indexing could take hours, days, or weeks or more; depending on multiple factors, nominally the size of your repository. That is stage 2 of the upgrade.
Once the new index is complete, switch the engine from the legacy one to the new one. This can be done in alfresco-global.properties with the index.subsystem.name or through the Admin Console Search services dialog. If you perform the latter, the configuration will be controlled by the database instead of alfresco-global.properties. This can lead to confusion in the future, so updating index.subsystem.name is recommended instead. After the switch is deemed successful, remove Solr1 and the old index. This is the completion of stage 3 of the upgrade.
If you are upgrading from Lucene to Alfresco v5.x or later, you can do it in 5 or 6 stages. This strategy is just the application of the 3 stage upgrade multiple times. If you are on v3.x or earlier, you will have to upgrade to v4.2 as an intermediary stage. In that case or if you were still using Lucene on v4.x, you will have to switch from Lucene to Solr1 before upgrading further. You can follow the principles outlined in the 3 stage upgrade to accomplish this task.
You may then do a 3 stage upgrade to Alfresco CS v5.x with Alfresco SS v1.x. This process requires 2 full re-indexes of the repository. One of them requires another intermediary version of Alfresco to run in production for a period of time. That time depends on how long it takes to perform the 1st full re-index. That means the intermediary version needs to be tested and verified as much as the final version. This strategy can be very inefficient and time consuming. However, it is a very pragmatic way to proceed.
This strategy is the primary purpose behind this blog post. It is a rather innovative way to avoid all the issues with the intermediary upgrade strategy while remaining transparent to the end user. In this solution, you will be creating new server instances for Alfresco. This is always the case with virtual servers and cloud architectures anyway. If you are going to reuse the existing servers, this strategy is not simple and should not be used.
When performing an upgrade, it is best to restore the production database to a non-production environment to test and verify the schema upgrade among other things. If you use the aggregate store where a read-only mount of the production content store is a secondary store, you don't need to restore the production content store to the non-production environment. In this non-production environment, install and configure the new Alfresco and its new indexing engine. At this point we have a snapshot of the production environment running on the new hardware, but with an empty search index.
In the Solr configuration file called solrcore.properties, change alfresco.lag to some value large enough to cover the maximum amount of time it will take to test and verify the non-production environment and eventually upgrade production. If you intend to backup/restore the production database to the non-production environment weekly, then the lag only needs to be about 8 days. Be conservative here. If you think it will take a maximum of 3 months to upgrade to production and you won't be routinely restoring the production database, set the alfresco.lag value to 1000 x 60 x 60 x 24 x 30 x 3 = 7,776,000,000 ms. Make sure to set this in the core templates and/or any cores already created.
Now for the waiting game. Let the index build to completion. If you are approaching you are ready to deploy to production well ahead of your prescribed maximum, you can shut down the index server application, change the lag to a smaller value, and start it back up. Just remember that the lag needs to be larger than the time the production snapshot was taken to the time the production deployment is scheduled to occur. For instance, if the database snapshot was taken on Feb 1 at midnight and your worst case plan on upgrading production is Oct 1 at midnight, then use a time around 86400000 x 30 x 9. If you find that everything is ready in early March and you want to reschedule the production upgrade on Apr 1 at midnight, then change the lag to 86400000 x (28+31+1).
This procedure will then properly index the repository to a certain time before the original snapshot. You cannot let it cross over that lag time threshold. If you do, you have to start over. If it crosses that threshold, the new Solr index will hold nodes created by the non-production startup, which is unacceptable.
As stated earlier, at any time you can create a new snapshot of the production database and restore it to the parallel non-production environment. In these cases, you can continue to use the existing Solr instances that may still be building the index. If you do this, it effectively resets the lag time starting with the most recent database snapshot. So in the example above, if you perform another snapshot on Mar 1 at midnight, it will effectively push the lag time out to Nov 1 instead of Oct 1. This is a good solution when you underestimate the production deployment schedule.
Now you are ready to upgrade the production environment. Shut down the production and non-production instances. Reclassify the non-production environment as your production environment. Point the new production environment to the production database and mount the content store read/write without the aggregate store. Change the alfresco.lag property in the solrcore.properties files back to 1000. And finally change any DNS entries that need to be changed to point end users to the new production servers. Once ready, start up the Alfresco components.
You are now upgraded with nearly a full index. This new upgrade will only take a few minutes to perform the database schema changes. Once up and running, the index engine will catch up to the latest data in the repository, closing the lag gap much quicker than having to do a full re-index. The time required to catch up can be computed based on the speed of the indexing you measured while the environment was considered non-production. Under a good strategy, it should catch up within hours.
To make the catch up time as short as possible, create more frequent production snapshots for the non-production environment. Do it often enough to use a low alfresco.lag value. For instance, set alfresco.lag to 86400000 x 1.1 and automatically create and restore snapshots nightly. The index will then only have to catch up on 1 day of transactions.
It is a great idea to create a non-production environment similar to the one used for this upgrade for longer term purposes. It gives you a real-life environment to reproduce and study production issues. It could create more read-only load on the content store, but the content store is typically not a bottleneck of concern.