topic Re: Lucene tokenization in Alfresco Archive

Lucene tokenization

morgand — Wed, 23 Sep 2009 15:59:59 GMT

Searches for files with underscores in the file name are currently unpredictable and returning no results in some cases. How can I prevent indexing from tokenizing file names with underscores into separate tokens?

Re: Lucene tokenization

morgand — Wed, 07 Oct 2009 14:47:30 GMT

I added some handling for underscores in the alfrescostandardfilter but still having problems when the underscore is followed by single digit, as in testarticle_1

anyone have any insight?

Re: Lucene tokenization

rliu — Wed, 07 Oct 2009 18:22:19 GMT

I encountered the same issue. Only after understanding how Lucene indexes, will you find out that characters such as underscore, dashes, etc. are not included.

One possible solution is to add a custom property on the node (of the content item) to capture the file name and tell Lucene not to tokenize this field. It would look as such:


<property name="xxx:filename_property">
        <description>Untokenised filename used by Lucene queries</description>
        <type>d:text</type>
        <mandatory>true</mandatory>
        <multiple>false</multiple>
        <index enabled="true">
                <tokenised>false</tokenised>
        </index>
</property>
‍‍‍‍‍‍‍‍‍‍‍

Please confirm if this works.

Re: Lucene tokenization

morgand — Thu, 08 Oct 2009 13:39:48 GMT

I encountered the same issue. Only after understanding how Lucene indexes, will you find out that characters such as underscore, dashes, etc. are not included.

One possible solution is to add a custom property on the node (of the content item) to capture the file name and tell Lucene not to tokenize this field. It would look as such:
<property name="xxx:filename_property">
        <description>Untokenised filename used by Lucene queries</description>
        <type>d:text</type>
        <mandatory>true</mandatory>
        <multiple>false</multiple>
        <index enabled="true">
                <tokenised>false</tokenised>
        </index>
</property>
‍‍‍‍‍‍‍‍‍‍‍
Please confirm if this works.

This was actually the first thing I tried, but I found that because the filename wasn't tokenized, searches for partial filenames weren't coming back consistently.

Re: Lucene tokenization

rliu — Thu, 08 Oct 2009 16:01:24 GMT

What does your Lucene query syntax look like?

Re: Lucene tokenization

tdt — Fri, 16 Oct 2009 09:29:41 GMT

Hi,

Did you try this?

<index enabled="true">
                  <atomic>true</atomic>
                  <stored>false</stored>
                  <tokenised>false</tokenised>
               </index>‍‍‍‍‍

When you create new content this will be applied. if you want it to be applied to the old content you'll have to reïndex alfresco.
Thats what ive done. And it worked fine.

Regards

Re: Lucene tokenization

morgand — Tue, 20 Oct 2009 12:42:13 GMT

Hi,

Did you try this?

<index enabled="true">
                  <atomic>true</atomic>
                  <stored>false</stored>
                  <tokenised>false</tokenised>
               </index>‍‍‍‍‍

When you create new content this will be applied. if you want it to be applied to the old content you'll have to reïndex alfresco.
Thats what ive done. And it worked fine.

Regards

yes, if I'm not mistaken that is basically the same thing rliu suggested. I guess the issue with this solution is that filenames need to be tokenised. I added tokenisation behavior for the standardfilter so that it occurs on underscorey, then it was discovered searches like "test_1" were not returning properly.

Re: Lucene tokenization

dbachem — Tue, 27 Oct 2009 09:04:05 GMT

So, which alfresco version are you working with?

In my Labs 3.0 the name field is declared with <tokenised>both</tokenised>, which should mean that both the single tokens and the complete name will be stored:


<property name="cm:name">
   <title>Name</title>
   <type>d:text</type>
   <mandatory enforced="true">true</mandatory>
   <index enabled="true">
      <atomic>true</atomic>
      <stored>false</stored> 
      <tokenised>both</tokenised>
   </index>
   <constraints>
      <constraint ref="cm:filename" />
   </constraints>
</property>
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

May <tokenised>both</tokenised> solve your problem?

Beside this i recognized a serious problem with indexing of 'd:text' and 'd:content' fields. In some cases (maybe during index merge processes) the index content is cuttened, so the creator admin will be cropped to "admi" in the index and will not be searchable with "admin" any longer! Currently I'm trying to get deeper into this.

Re: Lucene tokenization

maqsood — Tue, 27 Oct 2009 09:36:59 GMT

Hi,

Can anyone help me out ..
I am using web service to search a file in alfresco repository, here's the code:


RepositoryServiceSoapBindingStub repositoryService = WebServiceFactory.getRepositoryService();         
        
        // Create a query object, looking for all items with alfresco in the name of text
        Query query = new Query(Constants.QUERY_LANG_LUCENE, "PATH:\"/app:company_home/cm:" + searchText + "\"");
        
        // Execute the query
        final Store STORE = new Store(Constants.WORKSPACE_STORE, "SpacesStore");
        QueryResult queryResult = repositoryService.query(STORE, query, false);
        
        // Display the results
        ResultSet resultSet = queryResult.getResultSet();
        ResultSetRow[] rows = resultSet.getRows();
‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I am passing file name without extension as

searchText

for ex:
Suppose i have two files File1.txt and file1.pdf and i want to search both the files just by passing file1 as my

searchText

I tried the same thing, query returns nothing. And when I tried searching as File1.txt, query returns the exact file.
What could be the possible modification I should do with the above query to get my expected result.

Any suggestion appreciated

Thanks in advance

Re: Lucene tokenization

morgand — Tue, 27 Oct 2009 16:01:50 GMT

Hi,

Can anyone help me out ..
I am using web service to search a file in alfresco repository, here's the code:
RepositoryServiceSoapBindingStub repositoryService = WebServiceFactory.getRepositoryService();         
        
        // Create a query object, looking for all items with alfresco in the name of text
        Query query = new Query(Constants.QUERY_LANG_LUCENE, "PATH:\"/app:company_home/cm:" + searchText + "\"");
        
        // Execute the query
        final Store STORE = new Store(Constants.WORKSPACE_STORE, "SpacesStore");
        QueryResult queryResult = repositoryService.query(STORE, query, false);
        
        // Display the results
        ResultSet resultSet = queryResult.getResultSet();
        ResultSetRow[] rows = resultSet.getRows();
‍‍‍‍‍‍‍‍‍‍‍‍‍‍
I am passing file name without extension as
searchText

for ex:
Suppose i have two files File1.txt and file1.pdf and i want to search both the files just by passing file1 as my
searchText

I tried the same thing, query returns nothing. And when I tried searching as File1.txt, query returns the exact file.
What could be the possible modification I should do with the above query to get my expected result.

Any suggestion appreciated

Thanks in advance

Post a new thread instead of hijacking this one.

Re: Lucene tokenization

morgand — Tue, 27 Oct 2009 16:34:46 GMT

So, which alfresco version are you working with?

In my Labs 3.0 the name field is declared with <tokenised>both</tokenised>, which should mean that both the single tokens and the complete name will be stored:
<property name="cm:name">
   <title>Name</title>
   <type>d:text</type>
   <mandatory enforced="true">true</mandatory>
   <index enabled="true">
      <atomic>true</atomic>
      <stored>false</stored> 
      <tokenised>both</tokenised>
   </index>
   <constraints>
      <constraint ref="cm:filename" />
   </constraints>
</property>
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
May <tokenised>both</tokenised> solve your problem?

Beside this i recognized a serious problem with indexing of 'd:text' and 'd:content' fields. In some cases (maybe during index merge processes) the index content is cuttened, so the creator admin will be cropped to "admi" in the index and will not be searchable with "admin" any longer! Currently I'm trying to get deeper into this.

Hello, thanks for the response! I'm using 2.1.5, do you know if tokenised>both is supported? I added it to the filename attribute in the contentModel.xml and then added a new document to the repository, it didnt solve the problem.

Re: Lucene tokenization

maqsood — Tue, 27 Oct 2009 19:14:36 GMT

Hi morgand,

Sorry, that was mistakenly posted in your thread. :cry:
i've already started a new topic for my query when realized my mistake.

Re: Lucene tokenization

nvsreeram — Tue, 27 Oct 2009 23:05:27 GMT

Morgand,

Regarding:

Searches for files with underscores in the file name are currently unpredictable and returning no results in some cases……

I found the behavior is quite predictable but wierd.
If you have not already tried, should try this in your custom property "filename_property" :

<stored>true</stored>
<tokenised>both</tokenised>

This would let the fields be stored in the index and then look into the Lucene index with Luke - http://www.getopt.org/luke/
That would give you an idea of how the fields are being tokenized and how you could search.

I did the same and observed the following:

1. test_name is tokenized as "test", "name"
2. test_my_name is => "test", "my", "name"
3. test_name10 is => "test_name10"
4. test_my_name10 => "test", "my_name10"
5. test_again_my_name10 => "test", "again", "my_name10"

Haven't tried out test_10 still.

Re: Lucene tokenization

morgand — Wed, 28 Oct 2009 16:23:59 GMT

Morgand,

Regarding:
Searches for files with underscores in the file name are currently unpredictable and returning no results in some cases……

I found the behavior is quite predictable but wierd.
If you have not already tried, should try this in your custom property "filename_property" :

<stored>true</stored>
<tokenised>both</tokenised>

This would let the fields be stored in the index and then look into the Lucene index with Luke - http://www.getopt.org/luke/
That would give you an idea of how the fields are being tokenized and how you could search.

I did the same and observed the following:

1. test_name is tokenized as "test", "name"
2. test_my_name is => "test", "my", "name"
3. test_name10 is => "test_name10"
4. test_my_name10 => "test", "my_name10"
5. test_again_my_name10 => "test", "again", "my_name10"

Haven't tried out test_10 still.

Ok, I'm trying to search with luke but I have a few nagging questions.

A) When choosing an index to load into luke, i look in ..\alfresco_data\alf_data\lucene-indexes\workspace\SpacesStore Why are there 5-10 different folders in there?

B) When i choose an index and in luke, look at the available fields I don't see filename, why not?

Re: Lucene tokenization

nvsreeram — Fri, 30 Oct 2009 18:14:36 GMT

Regarding:

A) When choosing an index to load into luke, i look in ..\alfresco_data\alf_data\lucene-indexes\workspace\SpacesStore Why are there 5-10 different folders in there?

I don't know of a concrete reason for this. But I've noticed whenever you do a full recovery of the index, the folders are replaced by a single one.
I suppose Alfresco likes to spread the index into multiple folders and upon optimization (optimizing the indexes or creating a fresh index from scratch) it merges the multi-folder index into a single folder.

Regarding:

B) When i choose an index and in luke, look at the available fields I don't see filename, why not?

There is no such field called filename(unless you create a custom field), Alfresco by default stores the tokenized file name (of the article) in this field - @cm:name or @{http://www.alfresco.org/model/content/1.0}name

That said, I am still trying to understand your actual requirement.

Re: Lucene tokenization

morgand — Fri, 30 Oct 2009 18:36:27 GMT

thanks for the response.

My basic requirement is to split filenames with underscore into separate tokens. The StandardTokenizer seems to handle underscores in strange ways, sometimes splitting/sometimes not.

Re: Lucene tokenization

nvsreeram — Fri, 30 Oct 2009 23:57:35 GMT

And I assume that would be to search by filename.
If that's the case, you can try out searching by ID (just to check if that's a fit for your need).

Lets consider an example (I am following some arbitrary path structure):
ID = testsitecom:/www/avm_webapps/ROOT/_content/en_US/testContentType/test_Content.xml
@{http://www.alfresco.org/model/content/1.0}name = test, content, xml (tokenized by underscore and dot)

You can search this way (wild card search):
ID:testsitecom\:"/www/avm_webapps/ROOT/_content/en_US/testContentType/test*xml"

instead of name search:
@cm\:name:test*xml

Re: Lucene tokenization

andy — Fri, 15 Oct 2010 14:35:58 GMT

Hi

The lucene standard analyser which we wrap does indeed do some funny things.
We did not appreciate this in the dim and distant past. It is now a pain to change this deafult as everyone would be forced to reindex.

The standard analyser tries to auto detect dates, computer names, emails, product codes, acronyms etc etc and may end up grouping tokens together when separated by /-. amongst others.
However, it is also good as general cross language default.

The way to avoid this is to use another analyzer
OR
tokenize as both and use Alfresco FTS and "=" to force the use of the untokenised field and use pattern matching.

Andy