Obsolete Pages{{Obsolete}}
The official documentation is at: http://docs.alfresco.com
3.0Search
Reviewed for 3.3
Design discussion for the FTS in 3.0. Refer to Full Text Search Query Syntax for the implemented FTS.
Background
Currently we expose the default Lucene query parser syntax for full text search support.
This excludes the availability of some advanced Lucene features, such as Span queries, which we would like to expose. It also ties us to the Lucene query syntax. This may not embed well if we were to expose a SQL query language. It is also additional work to upgrade as we have out own customisations to the query parser.
The Data Dictionary (DD) defines indexing behaviour by binding a Lucene tokeniser to index types. So there is only basic control of indexing per property (indexing on/off, tokenisation on/off).
Possible DD extensions for indexing a property
This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we would have to store in the index to support this.
- Indexed
- as it is now
- Index priority
- FTS importance - not really of much use as lucene can not do an update and requires a delete and add. Indexing twice will be enough! We could prioritise documents to be FTS indexed in preference to others.
- Pluggable indexers in addition to the core
- Add an API to allow configurable index extensions. Add your own fields to the Lucene index.
- Will require support to do a cascade update for the most common use case (tag a file is in some path)
- Performance improvements and caching for path searches may work just as well
- The other use case is XML metadata extraction without populating alfresco properties
- Orderable -
- Support for sort which may over lap with use as an identifier and support FST
- Included with tokenisation
- Tokenised
- We should be able to support FTS and SQL like pattern matching. The tokenisation requirements are different so an attribute may need indexing twice. The identifier like indexing may also be more appropriate for ordering
- This will be a comma separated, case-insensitive list of what is required of tokenisation.
- ID, FTS, SORT
- BOTH -> FTS, SORT
- TRUE -> FTS (backward compatibility and default)
- FALSE -> ID, SORT
- ID and SORT are distinguished to separate IDs that do not support sort
- Currently we support. BOTH, TRUE and FALSE
- Boost
- We do not set document or field boosts at index time. Changing a field boost would require a reindex so all documents with the field have the same boost setting.
- case sensitivity
- Depends on the analyzer
- diacritics
- Depends on the analyzer
- stemming
- Depends on the analyzer
- thesaurus/synonyms
- Depends on the analyzer
- stop words
- Depends on the analyzer
- language/localisation
- localisation is driven by:
- the locale of each value for a multi-lingual text property
- the locale set d:content types
- IF UNSET, IN ORDER
- the locale set on the node
- the locale of the user
- the server locale
- wildcards
- Defined by the tokeniser and tokenisation properties
- position/ordering
- This information is held in Lucene as offset from the previous token
- window/range/distance
- This information is available in the index
- scope
- sentence/paragraph/page/chapter
- This information is not available in the index (it does not store the token type)
- We would have to add this information some how
- anchoring
- start/end/...
- start is easy - end is not unless we add special support
- cardinality number of occurances
- This information is held in the index (and is used for scoring)
- exclusion
- Documents are excluded via permissions
- documents like this
- We do not store term vectors. This could be an option for 'more like this'
- Doing this analysis on the fly would be expensive.
- cross language support
- We would have to tokenise again without stop words to make this sensible
- This is a lot of extra work (we just use the tokens each tokeniser generates at the moment)
- We could use the exact text rather than the token and put these through the standard analyser with no stop words. Each language would then add the words it considers meaningful in some common form without stemming etc.
- Index time versus search time
- We should expose this to our analyser wrappers (so synonym generation, if we had it, could be index or search side only - and not both) FTS token generation already does this in a weak way ...
- Tokenisation bundle.
- On a DD property specify the name of a tokenisation bundle to use
- Will pick up the tokeniser it defines by locale and property type
- Allows mixed tokenisation, property specific tokenisation etc
Query time options
This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we can do at search time to support this.
- boost
- can be set at query time for each individual query
- thesaurus/synonyms
- dependent upon the analyzer
- languages
- as selected - more languages -> generates more language specific tokens
- position/ordering
- Supported within Lucene (phrases, proximity, span)
- window/range/distance
- Supported within Lucene (proximity, span)
- scope sentence/paragraph/page/chapter
- not supported
- anchoring start/end/...
- start is possible, end would need special index support
- cardinality number of occurances
- Included in the scoring
- We could expose as a specific part of the query language
- exclusion
- via ACL
- documents like this
- See indexer support
- cross language support
- See indexer support
Pluggable indexing
Support to add customer defined indexing and search behaviour
Allow additional, user defined fields in the index.
e.g. indexing of XML content via extraction based on DTD definitions (which could be done when content or metadata is indexed)
Requires node context (path etc) to be available.
FTS Syntax
Based on Google with Lucene extensions.
Google like (Part 1)
- Search for a single term
- banana
- Search for conjunctions (the default)
- big yellow banana
- Search for disjunctions
- big OR yellow OR banana
- Search for phrases
- 'Boris the monkey eating a banana'
- Not
- yellow banana -big
- +
- the term is used as is
- no plurals
- no synonyms
- no stemming or tokenisation
- the word is not treated as a stop word
- Synonym expansion for a term
- ~big yellow banana
- Specify the field to search
- Google advanced operators
- field:term
- fieldhrase
- TYPE:'cm:content'
- direct or some other exposure of Lucene fields via property QNames etc
- path, aspect support
- Proximity
- Google separated by one or more words
- big * banana
- Range
- [#]..[#]
- Control
- order
- limit/paging
- Notes:
- To support Google + we would have to index stop words but mostly ignore them at search time.
- Google + conflicts with the Lucene use of the same token for required (AND should be sufficient)
- - is not allowed on its own (or reports no matches)
Lucene Extensions (Part 2)
- Support AND for explicit conjunctions
- big AND yellow AND banana
- Wild cards for terms and within phrases
- Fuzzy matches
- term~
- Phrase proximity
- phrase~proximity
- Range queries (inclusive and exclusive)
- {# TO #}
- [# TO #]
- Query time boosts
- term^boost
- Not
- Also include ! and NOT
- grouping of query elements
- general
- (big OR large) AND banana
- field
- titlebig OR large) AND banana
Further extensions (Part 3)
- Explicit spans/positions
- start
- woof[^]
- end
- woof[$]
- separation
- yellow banana[0..2]
- Occurrences
- banana{2}
- banana{,2}
- banana{2,}
- banana{2,4}
- Positions
- Phrase (??)
- Sentence (s)
- yellow[^S] banana[S]
- Yellow at the start of a sentence that also contains the work banana
- Paragraph (p)
- Support to specify languages, tokeniser and thesaurus to use for given terms
- field:banana
- field:<en_uk>banana
- Notes:
- End will require special support. The most common requirement is to find files based on the name ending pattern. This can in fact be done (and is perhaps better) against the content mimetype which is already in the index.
- Positions look like a pain
Alfresco FTS
See Full_Text_Search_Query_Syntax
Alfresco FTS Query Builder
Register query languages with a a query builder that generates Alfresco FTS.
The search service will allow queries with languages like 'ui', 'rm', 'opensearch', 'google', 'share'.
The query will be processed by the query builder and appropriate definition.
This definition includes:
- Components to expose
- Macros for term generation
- macro expansion to complex queries
- simple field mapping
- constraints
To be resolved ...
- support for well known namespaces (usability)
- name does not need to be prefixed
- name:text = cm_name:text
- support for property mappings to simple aliases
- name -> cm_name
- status -> my_aspect.my_property
- system wide property mappings
- persistable in queries
- user mappings (which can not be persisted in saved queries)
- Persisted queries
- Remove and add user preferences for field mappings (out of scope here)
- TODO:
- mappings and where they are defined
- Date format handling + date functions (not included in CMIS)
- Locale handling
- Query constraints and functions e.g. TODAY + 2w
- FTS vs ID
- If both are available when to use
- Exact match
- FTS match
- FTS pattern match
- SQL pattern match
- Expose direct (not embedded)
FTS vs Embedded vs RM
- Selector
- Embedded -> selector, implied single selector or error
- UI -> No selector
- RM -> No selector
- Fields
- Embedded - CMIS style (cm_content:'woof')
- UI - can use mappings to avoid namespacing
- RM - RM mappings?
- Field collision (see context above)
- Embedded - fully specified - no issue
- UI - fully specified - no issue
- UI - 'well known' or mapped - no issue
- UI - no prefix, matches local name in more than one namespace
- Error
- OR together
- Could distinguish by case
- Namespace search order
- RM - as UI - specific mappings?
- Default Field (part of context)
- Required for RM
- All ready have this idea - contextual in some way?
- Simple search (part of context - could have a consistent SIMPLE field)
Resources
http://www.google.com/support/websearch/bin/answer.py?answer=136861
http://www.blackbeltcoder.com/Articles/data/easy-full-text-search-queries