cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene tokenization

morgand
Champ in-the-making
Champ in-the-making
Searches for files with underscores in the file name are currently unpredictable and returning no results in some cases.  How can I prevent indexing from tokenizing file names with underscores into separate tokens?
17 REPLIES 17

morgand
Champ in-the-making
Champ in-the-making
I added some handling for underscores in the alfrescostandardfilter but still having problems when the underscore is followed by single digit, as in testarticle_1

anyone have any insight?

rliu
Champ in-the-making
Champ in-the-making
I encountered the same issue. Only after understanding how Lucene indexes, will you find out that characters such as underscore, dashes, etc. are not included.

One possible solution is to add a custom property on the node (of the content item) to capture the file name and tell Lucene not to tokenize this field. It would look as such:


<property name="xxx:filename_property">
        <description>Untokenised filename used by Lucene queries</description>
        <type>d:text</type>
        <mandatory>true</mandatory>
        <multiple>false</multiple>
        <index enabled="true">
                <tokenised>false</tokenised>
        </index>
</property>

Please confirm if this works.

morgand
Champ in-the-making
Champ in-the-making
I encountered the same issue. Only after understanding how Lucene indexes, will you find out that characters such as underscore, dashes, etc. are not included.

One possible solution is to add a custom property on the node (of the content item) to capture the file name and tell Lucene not to tokenize this field. It would look as such:


<property name="xxx:filename_property">
        <description>Untokenised filename used by Lucene queries</description>
        <type>d:text</type>
        <mandatory>true</mandatory>
        <multiple>false</multiple>
        <index enabled="true">
                <tokenised>false</tokenised>
        </index>
</property>

Please confirm if this works.


This was actually the first thing I tried, but I found that because the filename wasn't tokenized, searches for partial filenames weren't coming back consistently.

rliu
Champ in-the-making
Champ in-the-making
What does your Lucene query syntax look like?

tdt
Champ in-the-making
Champ in-the-making
Hi,

Did you try this?

<index enabled="true">
                  <atomic>true</atomic>
                  <stored>false</stored>
                  <tokenised>false</tokenised>
               </index>
When you create new content this will be applied. if you want it to be applied to the old content you'll have to reïndex alfresco.
Thats what ive done. And it worked fine.

Regards

morgand
Champ in-the-making
Champ in-the-making
Hi,

Did you try this?

<index enabled="true">
                  <atomic>true</atomic>
                  <stored>false</stored>
                  <tokenised>false</tokenised>
               </index>
When you create new content this will be applied. if you want it to be applied to the old content you'll have to reïndex alfresco.
Thats what ive done. And it worked fine.

Regards

yes, if I'm not mistaken that is basically the same thing rliu suggested.  I guess the issue with this solution is that filenames need to be tokenised.  I added tokenisation behavior for the standardfilter so that it occurs on underscorey, then it was discovered searches like "test_1" were not returning properly.

dbachem
Champ in-the-making
Champ in-the-making
So, which alfresco version are you working with?

In my Labs 3.0 the name field is declared with <tokenised>both</tokenised>, which should mean that both the single tokens and the complete name will be stored:


<property name="cm:name">
   <title>Name</title>
   <type>d:text</type>
   <mandatory enforced="true">true</mandatory>
   <index enabled="true">
      <atomic>true</atomic>
      <stored>false</stored>
      <tokenised>both</tokenised>
   </index>
   <constraints>
      <constraint ref="cm:filename" />
   </constraints>
</property>

May <tokenised>both</tokenised> solve your problem?

Beside this i recognized a serious problem with indexing of 'd:text' and 'd:content' fields. In some cases (maybe during index merge processes) the index content is cuttened, so the creator admin will be cropped to "admi" in the index and will not be searchable with "admin" any longer! Currently I'm trying to get deeper into this.

maqsood
Confirmed Champ
Confirmed Champ
Hi,

Can anyone help me out  ..
I am using web service to search a file in alfresco repository, here's the code:

RepositoryServiceSoapBindingStub repositoryService = WebServiceFactory.getRepositoryService();        
       
        // Create a query object, looking for all items with alfresco in the name of text
        Query query = new Query(Constants.QUERY_LANG_LUCENE, "PATH:\"/app:company_home/cm:" + searchText + "\"");
       
        // Execute the query
        final Store STORE = new Store(Constants.WORKSPACE_STORE, "SpacesStore");
        QueryResult queryResult = repositoryService.query(STORE, query, false);
       
        // Display the results
        ResultSet resultSet = queryResult.getResultSet();
        ResultSetRow[] rows = resultSet.getRows();

I am passing file name without extension  as
searchText

for ex:
Suppose i have two files File1.txt and file1.pdf and i want to search both the files just by passing file1 as my  
searchText

I tried the same thing,  query returns nothing. And when I tried searching as File1.txt, query returns the exact file.
What could be the possible modification I should do with the above query to get my expected result.

Any suggestion appreciated

Thanks in advance

morgand
Champ in-the-making
Champ in-the-making
Hi,

Can anyone help me out  ..
I am using web service to search a file in alfresco repository, here's the code:

RepositoryServiceSoapBindingStub repositoryService = WebServiceFactory.getRepositoryService();        
       
        // Create a query object, looking for all items with alfresco in the name of text
        Query query = new Query(Constants.QUERY_LANG_LUCENE, "PATH:\"/app:company_home/cm:" + searchText + "\"");
       
        // Execute the query
        final Store STORE = new Store(Constants.WORKSPACE_STORE, "SpacesStore");
        QueryResult queryResult = repositoryService.query(STORE, query, false);
       
        // Display the results
        ResultSet resultSet = queryResult.getResultSet();
        ResultSetRow[] rows = resultSet.getRows();

I am passing file name without extension  as
searchText

for ex:
Suppose i have two files File1.txt and file1.pdf and i want to search both the files just by passing file1 as my  
searchText

I tried the same thing,  query returns nothing. And when I tried searching as File1.txt, query returns the exact file.
What could be the possible modification I should do with the above query to get my expected result.

Any suggestion appreciated

Thanks in advance

Post a new thread instead of hijacking this one.