topic Re: searchService returns same nodeRef twice (duplicate index in solr) in Alfresco Forum

searchService returns same nodeRef twice (duplicate index in solr)

vincent-kali — Wed, 14 Jun 2017 15:40:26 GMT

Dear all,For some reason, we're using a custom REST API to perform searches on repo (alf community 5.1.g). We discovered that "sometimes", this custom API returns the same nodeRef more than once.Search code is below:<code> ResultSet resultSet = null; List<NodeRef> results = null; try{

Re: searchService returns same nodeRef twice (duplicate index in solr)

mehe — Thu, 15 Jun 2017 16:18:11 GMT

I had a similar problem with 5.0.?, there was also a JIRA for this [MNT-13767] Using disjunction "OR" in CMIS query returns wrong number of results when SOLR 4 is used - Alfresco JIRA

To get rid of the duplicates, you should reindex your repo (see JIRA) or use the fix option described in Troubleshooting Solr Index | Alfresco Documentation

If your system creates new duplicates, you could try an update to a newer version - if not bound to the older version for some reason.

Re: searchService returns same nodeRef twice (duplicate index in solr)

vincent-kali — Fri, 16 Jun 2017 08:04:54 GMT

Thanks for your reply.

The repo contains millions of documents.... It may takes days for a complete reindexing ? Do you have any benchmark on this ?

Is solr available for query during the reindex period ? Does alfresco repo have to retransform the whole content to post to solr ? (which would be unacceptable during production)

I also noticed that some duplicate are cleaned automatically by SOLR.... Is there any background process in solr that clean this kind of errors ?

I finally found that NodeBrowser/Standard search API is performing DB query, while my code is performing SOLR query (that's why I get some duplicate in my API, but not in NodeBrowser).

Even when setting query consistency to "QueryConsistency.TRANSACTIONAL_IF_POSSIBLE" in my code.

I was expecting a query like =cm\:name:myFileName.txt to go to DB instead of SOLR.... Am'I wrong ?

Thanks for your comments/advises

Vincent

Re: searchService returns same nodeRef twice (duplicate index in solr)

mehe — Fri, 16 Jun 2017 08:38:19 GMT

Solr reindexing is a "destructive" operation (until now) - you have to delete (I always rename the index dir, in case of the planned downtime is too short to rebuild, so I can switch back to the old index) the index and restart alfresco. Then alfresco will rebuild the index and, as you feared - also the content will be transformed and reindexed. In the newer alfresco versions, there are a kind of "content segments" in the solr data store maybe there is a way to prevent the indexing process from the necessity of transforming every document again - but I don't know.

Solr is available during reindex, but you can only "see" the data that is already processed - and solr is under heavy load while reindexing. So you should only reindex when nobody is working with alfresco (Weekend, planned downtime)

Auto cleaning? Don't know... But you can use the described "fix" option as a first try to eliminate the duplicates, that won't be so harmful to your users - but on big repos I do that at night.

Reindexing time is also dependent on the size of your content and your server hardware. Reindexing took me 2 days for a repo with about 10.000.000 Docs at 3TB content. You can tune your reindexing process with solrcore.properties (batch size, number of threads and so on). (storing the index on SSD?)

If there is no possibility for a reindex on your production system, you can clone your system from a backup (DB and content), reindex the clone and transfer the new index to your production system and switch solr to the new index (stop solr, move the index data dir and start it again). The index tracker will recognize the missing transactions and catchup in a short(er) time. This minimizes your alfresco downtime.

You can also use the clone for a benchmark of your reindexing process.

Are you using the same tomcat for alfresco and solr?

I think queries like =cm\:name:myFileName.txt are FTS (Fulltext) queries that always operate on the index - but I haven't tried the QueryConsistency.TRANSACTIONAL_IF_POSSIBLE until know. I thought this had to be configured in the repo too and some extra db indexes have to be created when using this...

regards,

Martin

Re: searchService returns same nodeRef twice (duplicate index in solr)

afaust — Fri, 16 Jun 2017 10:15:54 GMT

There is technically no need for a downtime during re-indexing. You can always create a new SOLR core to build a new index while you keep the old core around for continued search availability. Once the new SOLR core is done indexing, you can simply switch out the index.

Vincent, did you check your historical SOLR logs for any indexing errors? Often I find that index inconsistencies are the result of exceptions that people - for some reason - keep ignoring.

The 10 million documents in 2 days that Martin mentions sounds like a reasonable amount for a "standard" (non-optimised) system. There are a lot of factors that affect the duration, e.g. number/size of transactions, ACLs etc. The best I have seen without extreme resources / scaling is about 300.000 - 400.000 documents per hour.

Queries like =cm:name:"myFileName.txt" are DB-compatible and by default Alfresco is set to the query consistency TRANSACTIONAL_IF_POSSIBLE. Martin is correct that additional indices have to be created on the DB and unfortunately Alfresco by default does not do this unless you configure:

system.metadata-query-indexes.ignored=false
system.metadata-query-indexes-more.ignored=false

At BeeCon I did a full session about transactional metadata queries for more information. (slides)

Re: searchService returns same nodeRef twice (duplicate index in solr)

mehe — Fri, 16 Jun 2017 10:27:12 GMT

Hi Axel,

Have you tried indexing a new core with a big repo and solr4? I had a very slow, nearly inaccessible system when trying this with the users online. Also the libreoffice conversion was a bottleneck (no jod converter on community).

So I decided to use a planned downtime...

Regards,

Martin

Re: searchService returns same nodeRef twice (duplicate index in solr)

cesarista — Fri, 16 Jun 2017 12:19:37 GMT

Hi Martin Ehe‌

In these cases, I reindex in parallel with a dedicated SOLR barebone machine (with as many resources as possible CPU, SSD disks for a shorter reindex time) and alfresco.war in it, for doing the indexation process in the local machine only disturbing database resources (but no other Alfresco nodes and the corresponding service). When indices are ready, I copy them to the original SOLR machine(s), using the barebone machine as the replacement in the SOLR balancer (if any). So downtime it's not strictly necessary. But it may be a long time depending on your CPU and disk resources.

Autocleaning is not the case exactly, but reindexing always obtains a healthier index, without "deleted files" that may degrade your searches and have bigger indices size.

Regards.

--C.

Re: searchService returns same nodeRef twice (duplicate index in solr)

mehe — Fri, 16 Jun 2017 12:31:47 GMT

Hi Cesar Capillas ,

Thank you for this recommendation! I was able to reindex only enterprise systems without downtime, having a spare Solr/alfresco.war node.

I didn't dare to spin up a second node in the community version.

So the proposed setup would be:

- temporary server with alfresco.war (from prod system, in case there are models applied) and Solr

- connect the temporary server to prod DB and prod filesystem

- reindex on temp system

Do I have to take care of disabling some cleanup jobs or something like this?

Re: searchService returns same nodeRef twice (duplicate index in solr)

cesarista — Fri, 16 Jun 2017 12:42:36 GMT

Well, I did It for enterprise edition. Not sure if applies exactly for Community edition.

--C.

Re: searchService returns same nodeRef twice (duplicate index in solr)

mehe — Fri, 16 Jun 2017 13:00:03 GMT

ok, on Enterprise Systems I always use 2 solr nodes with alfresco in the cluster and don't have a downtime when reindexing one of the machines - conversion of documents to text is also better on enterprise versions, because they can scale the libreoffice conversion via jod converter.

The original question was in context of alfresco 5 community, which has no cluster option.

But the scenario could work with community, even if it's not cluster aware, because the index-tracker just asks the db about the transactions and reindexes metadata and reads content... have to test that... would make the clone unnecessary... hmmm...

Axel Faust‌ have you ever tried something like Cesar Capillas proposed for enterprise on the community edition? Or is there an easier way (besides the extra core) (could be a poor mans solr cluster )

Is it possible to set the "second" alfresco in readonly mode and nevertheless do a full solr reindex?

Re: searchService returns same nodeRef twice (duplicate index in solr)

afaust — Fri, 16 Jun 2017 13:49:10 GMT

There is no difference between Enterprise and Community Edition regarding the approach of using a separate core (on same system or a separate SOLR does not matter either). Actually, Community Edition is way more flexible here due to the SOLR licensing for Enterprise.

The conversion via JODConverter is not "better" per se, e.g. it is not faster in any way. The only improvement it brings is that JODConverter can be used to utilise parallel instances of LibreOffice and helps with LibreOffice process health by restarting the processes automatically.

Setting Alfresco in 100% read-only mode is impossible unless you use a DB user with only read-access privileges. There are various code pieces during startup that overrule any read-only setting configured via alfresco-global.properties (e.g. the default transaction mode which you can set). And I assume those functions will fail if you use a database user with read-only access. But it is possible to have a 98% read-only mode Alfresco that is shielded from any user requests that supports only SOLR. A couple of my customers are doing that.

In Community Edition you'd either have to use a 3rd-party clustering module to ensure its caches are consistent or disable the core caches for nodes to make sure that you always read the consistent state from the database.

Re: searchService returns same nodeRef twice (duplicate index in solr)

mehe — Fri, 16 Jun 2017 14:09:40 GMT

I thought the JOD converter is much faster, because you can use parallel instances of libreoffice conversions (as long as you have CPU cores, normally I use 4 to 6 instances on different ports if there many Office conversions to do). Since the Indexer is no more single threaded, it can use also the parallel instances to complex-convert to text, so I thought this would be better than the single libre office thread on community. Am I missing something or did I misunderstand the whole thing?

Re: searchService returns same nodeRef twice (duplicate index in solr)

vincent-kali — Fri, 16 Jun 2017 14:32:40 GMT

Many thanks for all your advises and comments.

Axel, when you say "using a separate core on same system" do you mean running two separate solr cores running in parrallele and both connected to a single alfresco instance ? Is it possible ? I've no clue how to do that...

The easiest way (but not the shortest one) for me would be to clone the full system as Martin says...

The link to your TMQ session looks very helpful, I'll check that !

Re: searchService returns same nodeRef twice (duplicate index in solr)

afaust — Fri, 16 Jun 2017 15:09:12 GMT

I am just saying that the transformation via JODConverter is not faster when comparing single-process to single-process. If you have the resources to parallelize JODConverter will of course be more efficient overall.

Re: searchService returns same nodeRef twice (duplicate index in solr)

afaust — Fri, 16 Jun 2017 15:12:24 GMT

Yes, I do mean running separate cores in parallel. Since a core is made up by the configuration folders in solrHome that containing a core.properties file, you can simply just duplicate one of the existing folders (e.g. workspace-SpacesStore or alfresco - depending on how they are called in your system), give it a distinct name and also configure its solrcore.properties to use a distinct storage location for its index. Next time you start SOLR, the new core config folder will be picked up and that core will start tracking Alfresco as per its configuration.

Re: searchService returns same nodeRef twice (duplicate index in solr)

douglascrp — Fri, 16 Jun 2017 17:14:06 GMT

You can use JOD converter on Community now.

Check this out dgcloud / alfresco-remote-jodconverter — Bitbucket

Re: searchService returns same nodeRef twice (duplicate index in solr)

mehe — Fri, 16 Jun 2017 17:21:45 GMT

Hi Douglas,

Cool project - looks like you are involved 🙂

Thank you for the link, I'll give a try in the nearest future. I was looking for something like that for a long time.

cu, Martin

Re: searchService returns same nodeRef twice (duplicate index in solr)

douglascrp — Fri, 16 Jun 2017 20:08:48 GMT

No, I am not involved in the project.

All I did was to test it, and it works.

Re: searchService returns same nodeRef twice (duplicate index in solr)

vincent-kali — Mon, 19 Jun 2017 10:06:35 GMT

OK I'll test the method you mentionned, and potentially put solr on a new server for better performances.

BTW, I confirm that some duplicate index in solr are automatically fixed (a query that return duplicate DBID day X will return single node a day after). Does it make sense for you ? (We're running massive bulk loading on this Platform).

thanks,

vincent

Re: searchService returns same nodeRef twice (duplicate index in solr)

andy1 — Fri, 15 Sep 2017 09:07:18 GMT

The SOLR index can fix itself for many issue without reindexing everything.

localhost:8080/solr/admin/cores?action=FIX&wt=json

It should fix any duplicates, stuff that is missing, etc.

You can also reindex nodes that match a query - or just do them one at a time.

As ‌ has said, there is no reason you can not have more than one solr index built from alfresco.

With community you can define one index to use. The second one you are building will be ignored - it will add some extra load. Once you are done you just need to flip over the configuration and use the new index. There are no helpful admin screens to do this in community and you will have to stop and restart to pick up the property changes.

If we can nail the route cause of anything like this it will be at the top of the fix list !

It really helps everyone if you can describe what you think the cause may be and raise it in ALF.

In general the fraction of deleted nodes in the index is not an issue. The background merge operations in lucene consider this along with other stuff when they decide which segments to merge. Index optimisation is not required as at was years ago and you will in fact throw away some segment level caches. Lucene improved support for lots of segments quite some time ago. Yes a few things scale with doc count - not enough to worry about.

For index rebuild time it depends what you measure. In SOLR 4 and 6 metadata is indexed ahead of content. SOLR caches the docs it adds to the repository for a number of reasons - one is to avoid content transformation at rebuild. Sharing the content is not good as two indexes may both try to write to the cache - you would have to copy it - I will give this some more thought. It would be easy enough to have one to use the cache read only for example.

Andy