cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene tokenization

morgand
Champ in-the-making
Champ in-the-making
Searches for files with underscores in the file name are currently unpredictable and returning no results in some cases.  How can I prevent indexing from tokenizing file names with underscores into separate tokens?
17 REPLIES 17

morgand
Champ in-the-making
Champ in-the-making
So, which alfresco version are you working with?

In my Labs 3.0 the name field is declared with <tokenised>both</tokenised>, which should mean that both the single tokens and the complete name will be stored:


<property name="cm:name">
   <title>Name</title>
   <type>d:text</type>
   <mandatory enforced="true">true</mandatory>
   <index enabled="true">
      <atomic>true</atomic>
      <stored>false</stored>
      <tokenised>both</tokenised>
   </index>
   <constraints>
      <constraint ref="cm:filename" />
   </constraints>
</property>

May <tokenised>both</tokenised> solve your problem?

Beside this i recognized a serious problem with indexing of 'd:text' and 'd:content' fields. In some cases (maybe during index merge processes) the index content is cuttened, so the creator admin will be cropped to "admi" in the index and will not be searchable with "admin" any longer! Currently I'm trying to get deeper into this.


Hello, thanks for the response!  I'm using 2.1.5, do you know if tokenised>both is supported?  I added it to the filename attribute in the contentModel.xml and then added a new document to the repository, it didnt solve the problem.

maqsood
Confirmed Champ
Confirmed Champ
Hi morgand,

Sorry, that was mistakenly posted in your thread.  :cry:
i've already started a new topic for my query when realized my mistake.

nvsreeram
Champ in-the-making
Champ in-the-making
Morgand,

Regarding:
Searches for files with underscores in the file name are currently unpredictable and returning no results in some cases……

I found the behavior is quite predictable but wierd.
If you have not already tried, should try this in your custom property "filename_property" :

<stored>true</stored>
<tokenised>both</tokenised>

This would let the fields be stored in the index and then look into the Lucene index with Luke - http://www.getopt.org/luke/
That would give you an idea of how the fields are being tokenized and how you could search.

I did the same and observed the following:

1. test_name is tokenized as "test", "name"
2. test_my_name is => "test", "my", "name"
3. test_name10 is => "test_name10"
4. test_my_name10 => "test", "my_name10"
5. test_again_my_name10 => "test", "again", "my_name10"

Haven't tried out test_10 still.

morgand
Champ in-the-making
Champ in-the-making
Morgand,

Regarding:
Searches for files with underscores in the file name are currently unpredictable and returning no results in some cases……

I found the behavior is quite predictable but wierd.
If you have not already tried, should try this in your custom property "filename_property" :

<stored>true</stored>
<tokenised>both</tokenised>

This would let the fields be stored in the index and then look into the Lucene index with Luke - http://www.getopt.org/luke/
That would give you an idea of how the fields are being tokenized and how you could search.

I did the same and observed the following:

1. test_name is tokenized as "test", "name"
2. test_my_name is => "test", "my", "name"
3. test_name10 is => "test_name10"
4. test_my_name10 => "test", "my_name10"
5. test_again_my_name10 => "test", "again", "my_name10"

Haven't tried out test_10 still.

Ok, I'm trying to search with luke but I have a few nagging questions. 

A)  When choosing an index to load into luke, i look in ..\alfresco_data\alf_data\lucene-indexes\workspace\SpacesStore  Why are there 5-10 different folders in there?

B)  When i choose an index and in luke, look at the available fields I don't see filename, why not?

nvsreeram
Champ in-the-making
Champ in-the-making
Regarding:
A) When choosing an index to load into luke, i look in ..\alfresco_data\alf_data\lucene-indexes\workspace\SpacesStore Why are there 5-10 different folders in there?

I don't know of a concrete reason for this. But I've noticed whenever you do a full recovery of the index, the folders are replaced by a single one.
I suppose Alfresco likes to spread the index into multiple folders and upon optimization (optimizing the indexes or creating a fresh index from scratch) it merges the multi-folder index into a single folder.

Regarding:
B) When i choose an index and in luke, look at the available fields I don't see filename, why not?

There is no such field called filename(unless you create a custom field), Alfresco by default stores the tokenized file name (of the article) in this field - @cm:name or @{http://www.alfresco.org/model/content/1.0}name

That said, I am still trying to understand your actual requirement.

morgand
Champ in-the-making
Champ in-the-making
thanks for the response.

My basic requirement is to split filenames with underscore into separate tokens.  The StandardTokenizer seems to handle underscores in strange ways, sometimes splitting/sometimes not.

nvsreeram
Champ in-the-making
Champ in-the-making
And I assume that would be to search by filename.
If that's the case, you can try out searching by ID (just to check if that's a fit for your need).

Lets consider an example (I am following some arbitrary path structure):
ID = testsitecom:/www/avm_webapps/ROOT/_content/en_US/testContentType/test_Content.xml
@{http://www.alfresco.org/model/content/1.0}name = test, content, xml (tokenized by underscore and dot)

You can search this way (wild card search):
ID:testsitecom\:"/www/avm_webapps/ROOT/_content/en_US/testContentType/test*xml"

instead of name search:
@cm\:name:test*xml

andy
Champ on-the-rise
Champ on-the-rise
Hi

The lucene standard analyser which we wrap does indeed do some funny things.
We did not appreciate this in the dim and distant past. It is now a pain to change this deafult as everyone would be forced to reindex.

The standard analyser tries to auto detect dates, computer names, emails, product codes, acronyms etc etc and may end up grouping tokens together when separated by /-. amongst others.
However, it is also good as general cross language default.

The way to avoid this is to use another analyzer
OR
tokenize as both and use Alfresco FTS and "=" to force the use of the untokenised field and use pattern matching.

Andy