Hyland Connect

pvaibhav · ‎08-27-2005

Friends,

I have a dream. The dream is to achieve one database record per
one file(let's say file for simplification, it is basically content) node.

Here is what I have thought in my dream - I can change the content model to store all the metadata
as one object. Say in XML. Thus every content item in my dream has only one property - my custom object
or my custom XML.

I will write a plugin to Lucene so that Lucene will understand my object (XML) and will index
the sub properties that I want.

Do you think this is possible in Alfresco? Our system needs to be really scalable.

Do you think that I am thinking in wrong direction? Can this dream come true? Do you think, this is not required at all and system can handle 50 million files without using Oracle?

Vaibhav

johnn · ‎08-27-2005

Hi Vaibhav -

I think the dream of 50 million objects is not an illusion. Our target is to ultimately match documentum's billion object mark. But I don't think it can be matched without a database, even if it isn't Oracle.

Databases give us two things that are of ultimate benefit - transaction control and separation of logical and physical schemas. Also they can give us performance for access on various properties using indexing and caching techniques that we don't need to build into the system. Partitioning of metadata across many machines and table partitions is another example of some things much easier on top of a relational database.

When I discussed with Doug Cutting, the author of Lucene, the type of information we wanted to manage and how we want to use it, even he said that there are some things best left to a database. Using a database gives us the ability to use tools designed for databases such as replication tools and query tools. Relational languages have evolved to be able to perfom many types of aggregations, joins, and other operations that would be very hard to do without a database. We don't need to build these features into the system, we can just integrate them in.

The fact that we use Hibernate, actually gives us lots of choice in terms of how we store our metadata. At the moment we have struck a good balance in terms of flexibility of what metadata is stored and the ability to access that information. Time and experience with applications will tell us whether we have the balance right. We have already gone more along the pendulum toward your vision than where we were with Documentum. We already store a lot of information into a serialized, but not necessarily XML form.

I gather from your vision that you would like to store many things, very quickly. How quickly and by how many people do you want to get them out? The balance really depends on the type of applications you are building.

What applications have you developed or are considering developing? What types of information should be stored in XML vs. retrievable through a relational interface? I believe that it is possible to scale to the levels you describe using the current tools at hand, but hard to tell without specifics. It would be good to start a conversation on this.

-john

pvaibhav · ‎08-30-2005

Thanks a lot for writing such a detaild reply. Indeed striking balance is important and can be application dependent.

In our domain thousands of projects are created every year. Each project can have multiple folders. (approximately 60 to 80 folders). folders will contain files. Each project can contain thousands of files.

Every project has short life cycle - 4 to 5 months. Even though a project is over, it cannot be archieved immediately. Can be archievd only after a year.

Content is likely to be 90 % images, 5% pdf and docs, 1% text and rest jars.

Everything is version managed - by default. This increases the volume even more.
User should be able to annotate any file.

Meta data templates can very. There will be a complicated permissions structure and permissions will be inherited (unless they are specifically overriden).

About 500 users will use it every day and daily transaction volume will be thousands of files checked out, thousands of files checked in.

Meta data search is absolutely necessary and some sort of image search will be required.

Nothing gets deleted physically. Delete histroy should be captured. Lot of metrics is needed to be generated -
How many times a particular user has accessed this file? When did he download it/ when did he upload it etc.

Durig the lifecycle of the system, metadata will keep on evolving and so does the search criterias.
name, mime-type, user annotations and versions are the most important attributes of a unit in our system. Later on some specific meta data fields will be added - about 5 to 10 per file. Most of the times buch of files are selcted and metadata is applied to all of them. These are some domain specific attributes and most will have string values.

Now here are some problems

1) 10 properties per file means 500 milion db records in a table.
2) Versioning and user annotations per file makes life harder, increases volume rapidly.
3) Searching is required across all the versions (version store is different in Alfresco currently and thus searching will have to be done separately)

How would you see Alfresco fitting into these requirements? - a high volume versioning system for binary files?

Hyland Connect

The dream of acheiving one record per one content node