cancel
Showing results for 
Search instead for 
Did you mean: 

URGENT:Sorting Lucene result on tokenized property possible?

serverok
Champ in-the-making
Champ in-the-making
Hi all,
we have the problem with sorting of lucene search result by some properties which are indexed and tokenised.

What is the best/fastest method to correctly sort lucene results by properties that contain multiple tokens (http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Sort.html)? :
<index enabled="true">
   <atomic>true</atomic>
   <stored>false</stored>
   <tokenised>true</tokenised>
</index>

Is the only choice to sort "manually" in the web script after running the lucene query without sorting?

I tried use JS sort function:
var result = search.luceneSearch(query);
result.sort(Sorting);
….
function Sorting(a, b) {
   var x = a.properties['cm:title'];
   var y = b.properties['cm:title'];
   return ((x > y) ? 1 : -1);
}

but if we have a big list of content in lucene result (100, 200 and more) - then this section
var x = a.properties['cm:title'];
var y = b.properties['cm:title'];
very long working.

Also we have around 50 properties to each content (custom Model) - maybe this is the reason why we have so long going on reading the value of properties?

I would be very grateful for help in solving this problem.

Thank's,
Oleg Koval
17 REPLIES 17

serverok
Champ in-the-making
Champ in-the-making
Additional information:

this is the simple code of our webscript:
// this query produces 458 rows on our fresh alfresco test installation with dummy content:
var query = "TYPE:\"{http://www.alfresco.org/model/content/1.0}content\"";

if( query != null && query.length > 0 ) {
   var result = search.luceneSearch(query); // fast

   // now to sort manually on tokenized values I'll need to look at each attribute:
   // I took out the sorting for simplicity: just saving to a temp variable shows slowness!
   var temp;
   for(i=0;i<result.length;i++) {
      temp = result[i].properties['cm:title'];  // very slow!
   }
}

you can see that we have one piece where we got bottleneck:
temp = result[i].properties['cm:title'];  // very slow!

I tested sorting and got this results:
* I have fresh alfresco installation with default content model
* for timing I used "YSlow" addon for FireFox
** I have fresh alfresco installation ( added 1 category and 1 subcategory, created 400  )
** the timing on your server may be slightly different taking into account the difference in hardware configurations
*** the time interval include: the time of working of webscript, reading (by cURL) the results by PHP script (a JSON data) and generation of webpage with the data
10 nodes  - ~2-3s
100 nodes - ~ 9-10s
200 nodes - ~19s
270 nodes - ~ 25s

90% of this time is simply examining one property of each node!

Also I have three question to all:
1) do you use the sorting for search result?
2) how you sort the search result's by tokenised fields?
3) if you use the standart JS "sort" function - what is timing for sorting around 200-300 results?

Thanks,
Oleg Koval

dwilson
Champ in-the-making
Champ in-the-making
I also see the same problem.

Interestingly it seems some attributes of the ScriptNode object are retrieved very quickly, but some are retrieved very slowly.  Why is this, and is there no way to programmatically look at individual properties on a node array without waiting almost 100 milliseconds for EACH property examined?  Looking at the results below, to view a node property it takes a full two orders of magnitude longer than viewing say, the node type.

(For testing I've used a sample URL executable script placed in Data Dictionary/ScriptsSmiley Happy

// In my test case, this query returns 394 nodes.
var query = "TYPE:\"{http://www.alfresco.org/model/content/1.0}content\"";
var total = "";
var out = "";

if( query != null && query.length > 0 ) {
   var result = search.luceneSearch(query); // fast
   total = result.length;
   
   for (i=0;i<result.length;i++) {

      // let's take a look at how long the script takes on various ScriptNode object attributes.
      // I uncommented each of these one by one and measured how long it took to print out that
      // individual ScriptNode attribute:

      //out += result[i].id +"<br>";          // ~  0.3 seconds on 394 nodes.
      //out += result[i].nodeRef +"<br>";     // ~  0.3 seconds on 394 nodes.
      //out += result[i].displayPath +"<br>"; // ~  0.6 seconds on 394 nodes.
      //out += result[i].qnamePath +"<br>";   // ~  0.4 seconds on 394 nodes.
      //out += result[i].isLocked() +"<br>";  // ~  0.3 seconds on 394 nodes.
      //out += result[i].type +"<br>";        // ~  0.4 seconds on 394 nodes.
      //out += result[i].parent.id +"<br>";   // ~  0.3 seconds on 394 nodes.
      //out += result[i].isCategory +"<br>";  // ~  0.2 seconds on 394 nodes.      
      //out += result[i].aspects +"<br>";     // ~  0.4 seconds on 394 nodes. !!! fast.
      //out += result[i].size +"<br>";        // ~ 27.4 seconds on 394 nodes.
      //out += result[i].url +"<br>";         // ~ 28.1 seconds on 394 nodes.
      //out += result[i].downloadUrl +"<br>"; // ~ 28.2 seconds on 394 nodes.
      //out += result[i].name +"<br>";        // ~ 27.8 seconds on 394 nodes.
   }
}

out += "<br><br>";
out += "<b>query:</b> "+query+"<br>";
out += "<b>total:</b> "+total+"<br>";
out;

It seems like the properties array just isn't loaded up into lucene result list of nodes.

kevinr
Star Contributor
Star Contributor
The Alfresco Lucene API can perform some sorting for you, and we provide access to that through the scripting API:
Array luceneSearch(string query, string sortColumn, boolean asc)
    Returns an array of ScriptNode satisfying the search criteria sorted by the specified sortColumn (the property name to sort on) and asc (true => ascending order, false => descending order). For example var nodes = search.luceneSearch("TEXT:alfresco", "@cm:modified", false);

If you need to sort by more than one column then the Script API does not yet provide this. But i can easily add it for Alfresco 3.2 since it has now been requested Smiley Happy

The reason some properties take longer than others to retrieve is that some properties are easily resolvable from data already cached on the ScriptNode instance - and some data must be retrieved directly from the repository (and then cached). Accessing individual properties of 1000's of nodes via the ScriptNode API is not going to be as fast as writing some Java code to do it - as the ScriptNode API calls must always pass through all levels of Permissions and Public Service Interceptors etc. for each call (Java code does not always need to do this…)

Thanks,

Kevin

dwilson
Champ in-the-making
Champ in-the-making
Thanks for the reply, Kevin!

The Alfresco Lucene API can perform some sorting for you, and we provide access to that through the scripting API:
Array luceneSearch(string query, string sortColumn, boolean asc)
    Returns an array of ScriptNode satisfying the search criteria sorted by the specified sortColumn (the property name to sort on) and asc (true => ascending order, false => descending order). For example var nodes = search.luceneSearch("TEXT:alfresco", "@cm:modified", false);
Ah, this was used at first but was not returning correct results with tokenized fields, e.g. cm:title because each token was being considered separately.

If you need to sort by more than one column then the Script API does not yet provide this. But i can easily add it for Alfresco 3.2 since it has now been requested Smiley Happy
That's great- thanks Kevin- Though while we're prioritizing features for Alfresco 3.2, I'd place these above multi-column sort as they are likely even more common needs:

  • Sorting correctly on a tokenized field
  • Returning only a specified page worth of all result data (e.g. results 21 through 30)
  • Perhaps with the now small set of data thanks to the above paging, all properties of those 10 nodes can be fetched & cached? (Although 100ms X 10 isn't quite as terrible.)
The reason some properties take longer than others to retrieve is that some properties are easily resolvable from data already cached on the ScriptNode instance - and some data must be retrieved directly from the repository (and then cached). Accessing individual properties of 1000's of nodes via the ScriptNode API is not going to be as fast as writing some Java code to do it - as the ScriptNode API calls must always pass through all levels of Permissions and Public Service Interceptors etc. for each call (Java code does not always need to do this…)
In order to tap into the power of the Java API, from the webscript would we call the Java API like this?  (Or is there another more standard way?)
http://wiki.alfresco.com/wiki/3.0_JavaScript_API#Native_Java_API_Access

Thanks for your help!
Dave

kevinr
Star Contributor
Star Contributor
Hi,

I'll talk to our Lucene guy today to see what is the issue around sorting on tokenized fields. If we can fix that or find you a solution then hopefully you won't have to go the java route. Yes that is the right link for integrating Java calls into secure WebScripts - but it's not a nice solution if you can avoid it then I would.

FYI "Returning only a specified page worth of all result data" - we need the repo to support paged resultsets for this to work - but currently it does not, so it just ends up re-querying and walking to page N - which is not much use. Paged ResultSet support is coming soon though…

Kev

serverok
Champ in-the-making
Champ in-the-making
Thank you for the reply, Kevin.

I'll talk to our Lucene guy today to see what is the issue around sorting on tokenized fields. If we can fix that or find you a solution then hopefully you won't have to go the java route. Yes that is the right link for integrating Java calls into secure WebScripts - but it's not a nice solution if you can avoid it then I would.

If we will use "external" sorting functions - then we need read values of the property which we use for sorting - then we get the situation above (view values some node property is very slow). The sorting by nontokenised and tokenised field in lucene will be VERY GOOD solution.

Thank you for your help.

Oleg.

kevinr
Star Contributor
Star Contributor
FYI a new JavaScript Search API has been added to Alfresco 3.2 (should be in the next nightly build or now in HEAD SVN)

Supports multi-column sorting, paging (once added to the underlying search API - before 3.2 final) and the new alfresco-fts search language:
http://wiki.alfresco.com/wiki/Full_Text_Search_Query_Syntax

- Query, language (lucene, xpath, jcr-path and alfresco-fts etc), store (workspace or avm), multi-column sorting and paging all supported via search definition object

- A query definition object with a number of parameter objects can be simple to use as:

   var results = search.query({query: "TEXT:alfresco"});

- Or as richly defined as:

   var sort1 =
   {
      column: "@{http://www.alfresco.org/model/content/1.0}modified",
      ascending: false
   };
   var sort2 =
   {
      column: "@{http://www.alfresco.org/model/content/1.0}created",
      ascending: false
   };
   var paging =
   {
      maxItems: 100,
      skipCount: 0
   };
   var def =
   {
      query: "cm:name:test*",
      store: "workspace://SpacesStore",
      language: "fts-alfresco",
      sort: [sort1, sort2],
      page: paging
   };
   var results = search.query(def);

Kev

dwilson
Champ in-the-making
Champ in-the-making
Kevin - That sounds perfect, I can't wait!!

samuel_penn
Champ in-the-making
Champ in-the-making
According to the wiki docs, in 3.2 the sort parameter is only available on search.luceneSearch(). If I need to search on an AVM store, then store.luceneSearch() doesn't have this option. Is this an oversight in the docs, or are these new snazzy options missing from AVM searches? If the latter, could they be added?

Thanks,
Sam.