cancel
Showing results for 
Search instead for 
Did you mean: 

searchService returns same nodeRef twice (duplicate index in solr)

vincent-kali
Star Contributor
Star Contributor

Dear all,

For some reason, we're using a custom REST API to perform searches on repo (alf community 5.1.g). We discovered that "sometimes", this custom API returns the same nodeRef more than once.

Search code is below:

<code>

ResultSet resultSet = null;
 List<NodeRef> results = null;
 try{
    SearchParameters sp = new SearchParameters();
    sp.setLanguage(this.services.getSearchService().LANGUAGE_SOLR_FTS_ALFRESCO);
    sp.addStore(new StoreRef("workspace://SpacesStore"));
    sp.setQuery(query);
    sp.setMaxItems(maxItems);
    SortDefinition sortDefinition = new SortDefinition(
    SearchParameters.SortDefinition.SortType.FIELD, "@" + sortField, sortAscending);
    sp.addSort(sortDefinition);
    logger.debug(" search - query: " + sp.getQuery());
    resultSet = this.services.getSearchService().query(sp);
    logger.debug("Results found: " + resultSet.getNumberFound());
    results = resultSet.getNodeRefs();
}
 finally{
    if(resultSet != null)
   {
       resultSet.close();
   }
   }
 return results;

</code>

The solr4 report indicates: "Count of duplicate nodes in the index":"100", meaning that there is errors in solr4 indexes.

1) Does somebody know how to fix this ?

2) When running the same query using "alfresco/service/slingshot/node/search" API, only one result is returned. Does it means that a duplicate check in performed within java node (I did not find anything in code related to this)

Thanks

vincent

21 REPLIES 21

afaust
Legendary Innovator
Legendary Innovator

There is no difference between Enterprise and Community Edition regarding the approach of using a separate core (on same system or a separate SOLR does not matter either). Actually, Community Edition is way more flexible here due to the SOLR licensing for Enterprise.


The conversion via JODConverter is not "better" per se, e.g. it is not faster in any way. The only improvement it brings is that JODConverter can be used to utilise parallel instances of LibreOffice and helps with LibreOffice process health by restarting the processes automatically.

Setting Alfresco in 100% read-only mode is impossible unless you use a DB user with only read-access privileges. There are various code pieces during startup that overrule any read-only setting configured via alfresco-global.properties (e.g. the default transaction mode which you can set). And I assume those functions will fail if you use a database user with read-only access. But it is possible to have a 98% read-only mode Alfresco that is shielded from any user requests that supports only SOLR. A couple of my customers are doing that.

In Community Edition you'd either have to use a 3rd-party clustering module to ensure its caches are consistent or disable the core caches for nodes to make sure that you always read the consistent state from the database.

mehe
Elite Collaborator
Elite Collaborator

I thought the JOD converter is much faster, because you can use parallel instances of libreoffice conversions (as long as you have CPU cores, normally I use 4 to 6 instances on different ports if there many Office conversions to do). Since the Indexer is no more single threaded, it can use also the parallel instances to complex-convert to text, so I thought this would be better than the single libre office thread on community. Am I missing something or did I misunderstand the whole thing?   

afaust
Legendary Innovator
Legendary Innovator

I am just saying that the transformation via JODConverter is not faster when comparing single-process to single-process. If you have the resources to parallelize JODConverter will of course be more efficient overall.

douglascrp
World-Class Innovator
World-Class Innovator

You can use JOD converter on Community now.

Check this out dgcloud / alfresco-remote-jodconverter — Bitbucket 

mehe
Elite Collaborator
Elite Collaborator

Hi Douglas,

Cool project - looks like you are involved 🙂

Thank you for the link, I'll give a try in the nearest future. I was looking for something like that for a long time.


cu, Martin

douglascrp
World-Class Innovator
World-Class Innovator

No, I am not involved in the project.

All I did was to test it, and it works.

vincent-kali
Star Contributor
Star Contributor

Many thanks for all your advises and comments.

Axel, when you say "using a separate core on same system" do you mean running two separate solr cores running in parrallele and both connected to a single alfresco instance ? Is it possible ? I've no clue how to do that...

The easiest way (but not the shortest one) for me would be to clone the full system as Martin says...

The link to your TMQ session looks very helpful, I'll check that !

afaust
Legendary Innovator
Legendary Innovator

Yes, I do mean running separate cores in parallel. Since a core is made up by the configuration folders in solrHome that containing a core.properties file, you can simply just duplicate one of the existing folders (e.g. workspace-SpacesStore or alfresco - depending on how they are called in your system), give it a distinct name and also configure its solrcore.properties to use a distinct storage location for its index. Next time you start SOLR, the new core config folder will be picked up and that core will start tracking Alfresco as per its configuration.

vincent-kali
Star Contributor
Star Contributor

OK I'll test the method you mentionned, and potentially put solr on a new server for better performances.

BTW, I confirm that some duplicate index in solr are automatically fixed (a query that return duplicate DBID day X will return single node a day after). Does it make sense for you ? (We're running massive bulk loading on this Platform).

thanks,

vincent

andy1
Star Collaborator
Star Collaborator

Hi

The SOLR index can fix itself for many issue without reindexing everything.

localhost:8080/solr/admin/cores?action=FIX&wt=json

It should fix any duplicates, stuff that is missing, etc.

You can also reindex nodes that match a query - or just do them one at a time.

As ‌ has said, there is no reason you can not have more than one solr index built from alfresco.

With community you can define one index to use. The second one you are building will be ignored - it will add some extra load. Once you are done you just need to flip over the configuration and use the new index. There are no helpful admin screens to do this in community and you will have to stop and restart to pick up the property changes.

If we can nail the route cause of anything like this it will be at the top of the fix list !

It really helps everyone if you can describe what you think the cause may be and raise it in ALF.

In general the fraction of deleted nodes in the index is not an issue. The background merge operations in lucene consider this along with other stuff when they decide which segments to merge. Index optimisation is not required as at was years ago and you will in fact throw away some segment level caches. Lucene improved support for lots of segments quite some time ago. Yes a few things scale with doc count - not enough to worry about.

For index rebuild time it depends what you measure. In SOLR 4 and 6 metadata is indexed ahead of content. SOLR caches the docs it adds to the repository for a number of reasons - one is to avoid content transformation at rebuild. Sharing the content is not good as two indexes may both try to write to the cache - you would have to copy it -  I will give this some more thought. It would be easy enough to have one to use the cache read only for example.

Andy