cancel
Showing results for 
Search instead for 
Did you mean: 

Search and Lucene questions

wweber
Champ in-the-making
Champ in-the-making
Greetings,

Looking at the "contentModel.xml" file, I can see the following for the cm:content type:
<type name="cm:content">
<title>Content</title>
<parent>cm:cmobject</parent>
<properties>
    <property name="cm:content">
       <type>d:content</type>
       <mandatory>false</mandatory>
       <!— Index content in the background –>
       <index enabled="true">
         <atomic>true</atomic>
         <stored>false</stored>
         <tokenised>true</tokenised>
       </index>
    </property>
</properties>
</type>
I understand that the "index" element and its children are referring to the Lucene index, not a database index. I have the following questions:

1. If index enabled is set to "false", will this property not be searchable and/or retrievable via the SearchService?

2. If index enabled is set to "false", then can we assume that the "atomic", "stored", and "tokenized" settings are irrelevant?

3. Is the comment "Index content in the background" correct? I thought you would only index content in the background if "atomic" is false.

4. In a previous discussion, I learned that using Lucene via the SearchService is strictly used for searching. That whenever data is retrieved from a ResultsSetRow (ResultsSetRow.getValue), Alfresco is always getting the data from the database, not from Lucene. Why would you ever set <stored>true</stored>?

5. The Wiki at http://www.alfresco.org/mediawiki/index.php/Full-Text_Search_Configuration says the following: "The default is that properties are indexed, indexed atomically, that the property value is not stored in the index, and that the property is tokenised when it is indexed." Does that mean that if the entire "index" element and its children are missing under the "property" element these defaults will be used?

6. If I set "tokenise" to true and I have a property that has the string "Hello World" then two words will be stored as two tokens with the "StandardAnalyzer", is that right? If I set "tokenise" to false, then the entire phrase "Hello World" will be stored and not broken up into individual word tokens, right? With "tokenise" set to true, I could search for either "Hello" or for "World" and get a hit on this property. If I set "tokenise" to false, then I would have to be searching for the entire phrase "Hello World" to get a hit on this property, right? If I set "tokonise" to false, it seems like I'd be storing the entire text as is and might take up as much space in the index as if I had set "stored" to true, is that right (my goal is to reduce index size)?

7. If a single property, either with the type itself or via an aspect applied to that type, has atomic set to true, will all properties be atomic, even if they have atomic set to false. In other words, can I store into the index in the background only if ALL properties (including aspects) have atomic set to false?

8. I have a stand-alone test client that is entering data in the repository and then exiting. I noticed that under the "…\alf_data\lucene-indexes\workspace\SpacesStore\delta" directory I am building a number of directories with a lot of numbers on them (like 8b41158a-5766-11da-af5f-cf53901d41b0). Some of these directories (most) only have a single file, the "segments" file, and the rest of them are totally empty. I noticed that if I sleep for 60 seconds before I exit my client, I don't build up a directory after entering a batch of data into the repository.  Are these directories here because their deletion is done in the background and I am exiting before they get a chance to get deleted? Will they ever get deleted? Would it hurt for me to delete them manually if they only have the "segments" file in them?

9. For my testing I have stored 1 million nodes with 12 properties for each node in the repository. I noticed that the "alf_data\lucene-indexes\workspace\SpacesStore\index" directory now has 19 files that end with ".cfs" that are about 100 mb in size. I have to assume that the more of these files I get, the slower will be my search response because all of these will have to be examined in order to get back my search results. Is there a way to control the size of these files? Could I have better search performance if these files were 300 mb each and there were less of these files? I noticed that the content of these files is largely composed of repeating QNames. Perhaps these are references to specific nodeId's. It seems that one of the side effects of the long QNames might be significantly increasing the size of both the Lucene index and database files. If we are anticipating storing a lot of data and performance is important, would you recommend using short QNames? That is don't use something like "http://www.mycompanyname.com/myproject/model/content/1.0" for our name space but something like "content1.0" (if we are just using the repository within our company)? Of course, if we use any of your aspects or extend any of your types, we will still get those long QNames. One possibility might be having our own version of the aspects and types with shorter QNames. Either that or we could edit your contentModel.xml and change the namespace definition (would that make sense?).

Ok, thanks for your patience. Perhaps the answer to these questions will be helpful to others.

——————————–
Thanks again Smiley Happy
Wayne Weber
Knight Ridder Digital
1 REPLY 1

andy
Champ on-the-rise
Champ on-the-rise
Hi Wayne

1. If index enabled is set to "false", will this property not be searchable and/or retrievable via the SearchService?

2. If index enabled is set to "false", then can we assume that the "atomic", "stored", and "tokenized" settings are irrelevant?

If index enabled is set to false then the property will not be indexed.
The other properties will have no effect - other then store how you would like indexing to behave should you enable indexing. If the property is used in a search it will find no matches. The property will be accessible from the result set as it goes to hibernate to recover the properties for a node.


3. Is the comment "Index content in the background" correct? I thought you would only index content in the background if "atomic" is false.

This comment is misleading. The indexing properties are there for people who do not want to index content atomically and make it easy to configure. If content transformations are expensive you may want transformations and indexing done in the background.


4. In a previous discussion, I learned that using Lucene via the SearchService is strictly used for searching. That whenever data is retrieved from a ResultsSetRow (ResultsSetRow.getValue), Alfresco is always getting the data from the database, not from Lucene. Why would you ever set <stored>true</stored>?

It is possible to store properties if you wanted to query against the same index direct with lucene and recover those properties.

5. The Wiki at http://www.alfresco.org/mediawiki/index.php/Full-Text_Search_Configuration says the following: "The default is that properties are indexed, indexed atomically, that the property value is not stored in the index, and that the property is tokenised when it is indexed." Does that mean that if the entire "index" element and its children are missing under the "property" element these defaults will be used?

Yes

6. If I set "tokenise" to true and I have a property that has the string "Hello World" then two words will be stored as two tokens with the "StandardAnalyzer", is that right? If I set "tokenise" to false, then the entire phrase "Hello World" will be stored and not broken up into individual word tokens, right? With "tokenise" set to true, I could search for either "Hello" or for "World" and get a hit on this property. If I set "tokenise" to false, then I would have to be searching for the entire phrase "Hello World" to get a hit on this property, right? If I set "tokonise" to false, it seems like I'd be storing the entire text as is and might take up as much space in the index as if I had set "stored" to true, is that right (my goal is to reduce index size)?

The summary of the search position is correct.

This is present for those cases when you do not want tokenisation.
For things like a primary key or external key that may contain characters that would get tokenised. In time you will be able to set the tokeniser per property. If untokenised, each unique entry would be in the index. If stored, each entry would be in the index.



7. If a single property, either with the type itself or via an aspect applied to that type, has atomic set to true, will all properties be atomic, even if they have atomic set to false. In other words, can I store into the index in the background only if ALL properties (including aspects) have atomic set to false?


No. All properties marked as atomic will be indexed in the transaction. The other properties will be indexed later. When the non-atomic properties are indexed they will all be indexed at once and augment the atomically indexed properties. The index knows which items were not fully indexed (Not yet indexed, out of date, or up to date). You have no control when things are indexed in the background. They are queued and done in the order they were added to the index. Starting the application will find any outstanding things to index (eg if the app was terminated before all indexing was brought up to date).

8. I have a stand-alone test client that is entering data in the repository and then exiting. I noticed that under the "…\alf_data\lucene-indexes\workspace\SpacesStore\delta" directory I am building a number of directories with a lot of numbers on them (like 8b41158a-5766-11da-af5f-cf53901d41b0). Some of these directories (most) only have a single file, the "segments" file, and the rest of them are totally empty. I noticed that if I sleep for 60 seconds before I exit my client, I don't build up a directory after entering a batch of data into the repository. Are these directories here because their deletion is done in the background and I am exiting before they get a chance to get deleted? Will they ever get deleted? Would it hurt for me to delete them manually if they only have the "segments" file in them?

If the repository is not running there is no harm deleting these. They are  created to hold transactional updates to the index. They are deleted after the transaction is committed or rolled back. The back groundindexing of non-atomic properties will also create index updates in its own transactions.

9. For my testing I have stored 1 million nodes with 12 properties for each node in the repository. I noticed that the "alf_data\lucene-indexes\workspace\SpacesStore\index" directory now has 19 files that end with ".cfs" that are about 100 mb in size. I have to assume that the more of these files I get, the slower will be my search response because all of these will have to be examined in order to get back my search results. Is there a way to control the size of these files? Could I have better search performance if these files were 300 mb each and there were less of these files? I noticed that the content of these files is largely composed of repeating QNames. Perhaps these are references to specific nodeId's. It seems that one of the side effects of the long QNames might be significantly increasing the size of both the Lucene index and database files. If we are anticipating storing a lot of data and performance is important, would you recommend using short QNames? That is don't use something like "http://www.mycompanyname.com/myproject/model/content/1.0" for our name space but something like "content1.0" (if we are just using the repository within our company)? Of course, if we use any of your aspects or extend any of your types, we will still get those long QNames. One possibility might be having our own version of the aspects and types with shorter QNames. Either that or we could edit your contentModel.xml and change the namespace definition (would that make sense?).

Cool

The indexing behaviour can be controlled using the standard lucene index parameters. These are set and explained in repository.properties. These can be set so you always have an optimised index. You are correct - querying would be faster - but indexing slower. The number of lucene segments is important but so are deletions.

Some things are currently stored in the index that are not strictly needed but are useful to see what is going on. These will be tidied up. Things like the primary key to the database do need to be stored. The size of the index is an important issue and does affect indexing performance.

It is also possible that not all of the those files are currently part of the index. Lucene can leave some old files around which it tries to delete later.
These are specified in a deletable file in the directory. The files that are actually in use are specified in the segments file.

I hope this helps

Regards

Andy