Obsolete Pages{{Obsolete}}
The official documentation is at: http://docs.alfresco.com
Search Prototype
Revised Index structure
- UUID (The unique identifier for the node)
- FTS (The full text search entry for the node)
- PATH (The full path to the node) If there are multiple paths this can be repeated)
- QNAME (The fully qualified name of the node)
- ANCESTORID (A repeated field containing the IDS of all the nodes ancestors including itself)
- LEVEL (The depth of this node relative to the root - may be repeated)
- WORKSPACEID (The id of the workspace)
- Attributes as (Name=@ns:name Value=value and also Name=@ns: and Name=name)
To Add
- Categories
- Role based read access control
Proposed Special Property Types
These property types will require special indexing and tokenisation
QName
Path
Category
Security
Lucene Query Extensions
To execute complex structure expressions a new type of query is required.
- StructuredFieldQuery
- StructuredQueryElement
- Next Query
- Root
- Needs to iterate over entries
- How to mark the root entry
- Fixed Position
- Name + position
- Next fixed or relative clause
- Relative Position
- Name
- Offset or any
- Next relative clause
- Simple tokeniser
- Path is
- Depth
- NameSpace + Name (repeating)
- Optional end marker followed by other paths
Comments
Issues
- Impact of renaming
- Impact of restructuring
- A bridge table does not make much sense
- Still have a big up date problem - better to split the path in a different way ...
- Cold use indirection for top level hierarchies
- Can not serach across tow indexes tha index the same docs without pulling out all docs amd joining on the primary key.
- Do not see a sensible way of partitioning below the store level
Performance
- 5 Million Paths on my laptop
- Returning 1 or 2 million result sets on simple paths in 1-3 seconds
- Indexing performance (99 attribute stored + one PATH as above * 5)
- 3 ms to add a document to an in memory index
- More efficient to then merge into the on disc index
- best is around 1 ms per doc
- This decreases as the size of the index increases
- 200,000 times = 1 million paths OK (10 iteratoins of appending 20000)
- 2M is more of a problem (10 million paths) decreasing (100 iterations appending 20000)
- slows to 16 ms/doc at the 20th iteration
The indexing performance could be due to the heavy common terms in attributes and similar paths.
Different machines all the same java and command line options
- My laptop
- Write to in memory index up to 20000 times (CPU limited)
- 2.06 ms per doc
- Merge 10 times to make 200000 (IO limited)
- 1.11 ms per doc
- Same laptop spec + mandrake 10.1
- Modo
Getting a document out of the above index
- 20,000 docs - all docs - 0.07 ms/doc
- 200,000 docs - all docs - 0.03 ms/doc
Todo:
- How fast to delete?
- How fast to optimise an index?
Search