cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene performance/wildcard searches not working

dbevacqua
Champ in-the-making
Champ in-the-making
we have a repo with about 400k nodes and 2.8M properties. A a very simple search on the lucene index for

@cms\:name:research

is taking over half an hour to complete (I have not actually bothered to let it finish). There will be tens of thousands of matches (searches for more obscure words are taking of order 1s). In the wiki & forums there are a number of references to being able to limit the number of results, but I can't find how to do it. Could you tell me?

The other thing is that wildcard searches do not seem to be working, so

@cms\:name:nanotechnology

gives some results, but

@cms\:name:nano
@cms\:name:nano*
@cms\:name:"nano*"
@cms\:name:"nano"*
etc

do not.

Any ideas?

Thanks,

Dominic
11 REPLIES 11

marcus
Champ in-the-making
Champ in-the-making
I must have been typing my post when you posted this… http://forums.alfresco.com/viewtopic.php?t=3680

I get matches when I use quotes around the search term, but not if I leave the quotes off.

dbevacqua
Champ in-the-making
Champ in-the-making
Hi Marcus

Looked at your posting. What happens if you do

@\{http\://www.lateralminds.com.au/model/name/1.0\}jurisdiction:'Aus*"

?

I ask because for me

@cms\:name:"nano*"

returns nothing, but

@cms\:name:"nanotechnology*"

returns what I would expect.

andy
Champ on-the-rise
Champ on-the-rise
Hi

You can limit the query results by using the SearchParameters object to define your search. See setLimitBy and setLimit. There is config to enable this on the client/UI search.

The time is not spent doing the lucene query but evaluating permissions.
If you have 1M matches but can only see one you are likely to have a similar problem. Improving this is on the TODO list.

Lucene supports query styles where you know the tokens and those where you do not ….

See http://wiki.alfresco.com/wiki/Search#Understanding_tokenisation


TEXT:woof

matches the token woof (which could have been generated from Woof etc). It is lowercased by default - but this depends on your tokeniser.

So

TEXT:*oof
TEXT:wo*
TEXT:*oo*
will work.


Phrases, things in  "" are tokenised for you. and do not support wildcards.
Again, this would be possible and is on the list of stuff that did not make 1.4.


In the examples, with the default config
@cms\:name:nano* should work but it depends on the tokens in the index …
@cms\:name:"nano*" will never work
@cms\:name:"nanotechnology*" should not work ….

How are you doing these searches?
Are you using the default tokeniser?

Regards

Andy

marcus
Champ in-the-making
Champ in-the-making
Is there a way to browse the lucene entries stored for a node? I've tried using the Luke lucene browser, but don't know which folder in the lucene.indexes directory hold those for my node, or how to just show all the information for a noderef even if I did happen to stumble on the node I was interested in Smiley Very Happy

Edit: The actual property value for the node is "Australia and New Zealand", and I'm not getting any match when doing

@lms\:jurisdiction:australia*
@lms\:jurisdiction:aust*
@lms\:jurisdiction:"Aust*"

but I do get matches with

@lms\:jurisdiction:"Australia and New Zealand"
@lms\:jurisdiction:"Australia*"

If I set this property in my model to not be tokenised, what will be the difference in the generated query from SearchContext.buildQuery? (I'll try this out later, but just in case my result differs from expected)

dbevacqua
Champ in-the-making
Champ in-the-making
Hi Andy

Thanks for your quick response.

I have used limit and limitBy as you suggested, and noted what you said about permission evaluation. It did raise another issue, namely is it possible to delay this evaluation? I ask because I need to be able to do join type searches, where the intermediate ResultSets may be large, but the final one not so - in this case I think it makes sense to evaluate permissions on the final ResultSet. I notice that in PermissionEvaluationMode there is a commented value LAZY which may be connected.


As for the wildcard problem, I am searching using the search service e.g.


            s.setLanguage(SearchService.LANGUAGE_LUCENE);
            s.setLimitBy(LimitBy.FINAL_SIZE);
            s.setLimit(100);
            s.setQuery(query);
            s.addStore(storeRef);

            r = getSearchService().query( s );

I am using the default tokeniser, and I still can't get any matches with wildcards.

I also get no matches searching on the TEXT field e.g

TEXT:nanotechnology

Have tried rebuilding the indexes but to no avail.

Any thoughts?

Thanks,

Dominic.

andy
Champ on-the-rise
Champ on-the-rise
Hi

You can use luke (in Alfresco up to version 1.3). After 1.4 is more complicated.

You can query as you would normally - except for path.
@cms\:title:wo* should work (but not prefix wild cards as you will get the default lucene query parser).

You can check what tokens are available for each field using Luke.

To do joins you could use the "searchService" bean. This does not apply permissions. You can do your queries and joins etc…. you will  then have to apply permissions on top. The easiest way is to add your own service and let the security layer filter a ResultSet.

The support for joins via lucene is not as neat as it could be.

Regards

Andy

dbevacqua
Champ in-the-making
Champ in-the-making
Hi Andy

Took your advice on using the searchService bean and that has had the desired effect. Joins coming along nicely.

I had a go with Luke and noticed some interesting things:

1. the analyser is slightly different (org.apache.lucene.analysis.SimpleAnalyzer) - won't accept the one in the dictionary model (org.apache.lucene.analysis.standard.StandardAnalyzer), possibly a classpath issue.

2. wildcard searches (equivalent to @cms:name:nano*) work where they didn't using searchService (query parser problem?)

3. none of my fields are marked as indexed, tokenised or stored

4. none of my fields appear to have anything in them (the String value appears as "<not available>")

5. TEXT, ISNODE, ISROOT… have nothing in them

Not really sure where to look next. Any ideas?

Dominic

dbevacqua
Champ in-the-making
Champ in-the-making
Similarly the ANCESTOR field appears not to contain anything, both in Luke and using queries that work using PARENT. All relevant nodes extend sys:container.

dbevacqua
Champ in-the-making
Champ in-the-making
Also, I'm trying to get the contents of the PARENT and ANCESTOR fields from the ResultSetRows. Any pointers?
Dominic