cancel
Showing results for 
Search instead for 
Did you mean: 

Store data files in BLOBs?

pascalsartorett
Champ in-the-making
Champ in-the-making
As I understand, alfresco stores its information in two different places:

- the meta-data in a relational database.
- the data files themselves on the file system.

This is a fairly common solution used by various ECM/WCM packages, but it has a few operational drawbacks:

- a complete backup must include both the database and the file system.
- no transactional integrity; there is always a risk that the database and the file system are not in sync.
- administrators must both manage the disk space for the data files and the DB files.

The common solution to this is to store the data files as BLOBs directly in the database. There used to be a big performance penalty when doing this, but the overhead seems to become acceptable when considering the suppressed drawbacks. Our customers just love it.

So, any plan to support also a pure BLOB mode in a future version?

Thanks for any information, including good reasons not to use BLOBs 🙂

Pascal
8 REPLIES 8

derek
Star Contributor
Star Contributor
Hi,

- a complete backup must include both the database and the file system.
- administrators must both manage the disk space for the data files and the DB files.
Both of the above generalized statements are true, however
- no transactional integrity; there is always a risk that the database and the file system are not in sync.
is not true for Alfresco.  We are not "fairly common" in the sense that Alfresco's content is transactionally safe and will always have finished writing the content before the transaction is allowed to commit.  In fact, we can guarantee replication of content to multiple stores within the same transaction.  I suggest you take a long, hard look at the code before making generalizations.  Specifically, take a look at (a) how the metadata is transactionally updated when the content IO channel is closed, (b) how the content URLs are assigned when writing to existing files, or © whether or not you can actually break it.

BLOBs are a nasty business to handle at the best of times.  But the worst of the issues is, as you mentioned, the inability to support random access without creating temporary, intermediate files.  Well, in fact, Alfresco will automatically create temporary files for content stores that don't provide random access, but the simplest and fastest solution is to just to use the filesystem and get the random access without any overhead.

Have you considered the cost of attempting to send over a MySQL or Oracle database dump of a 3.5 million document repository for diagnostics or other support?  If you have all your content in the database, it means that this is not a viable option.  With our current set up, we have already had customers send us their database without their content for support purposes.  We were able to diagnose problems and play around but didn't have access to their sensitive data.

Thanks for your interest.

pascalsartorett
Champ in-the-making
Champ in-the-making
Thank you for your detailed answer; alfresco has clearly been implemented by professionals 🙂

Alfresco's content is transactionally safe and will always have finished writing the content before the transaction is allowed to commit.
Sorry for being picky: if you have a crash right after having written the content, but before having commited the transaction, then you may have some orphan files on the disk (which is indeed no big deal).

My main point was that our customers like BLOBs because it eases their lives.
And we (developers) like it because it makes it harder for them to "break" a system, e.g. by manually deleting files that they suppose unneeded (true story).

Hence, BLOBs support would be nice to have…

derek
Star Contributor
Star Contributor
Hi,

Ofcourse, the orphans are inevitable, but are dealt with by a separate job that provides the cleanup that is required as a result of rollbacks and deletions.

If by "customers" you mean system administrators ("them"), then yes, it makes their lives simpler.  If, however, by "customers" you mean system end users, then I can't see how they care at all as long as it is as fast as possible.  Naturally, we have considered the trade-off, and considering that we require the backup of Lucene indexes, feel that the filesystem storage to be the best solution.

Let's assume that we provide an alternative BLOB-based ContentStore implementation:
Sir Alan: I notice that the content streaming is slower than before.  Why is that?
Admin: The content is stored in the database now - it takes a little longer to access and stream out.
Sir Alan: But what was wrong with the way it did it before?
Admin: It's a pain to backup the storage directory.
Sir Alan: YOU'RE FIRED!
Jokes aside, it makes the point:  Speed is a definite win over the effort of backing up a directory - a mundane task.

A BLOB-based, optional store has been on the would be nice list for quite a while (http://www.alfresco.org/jira/browse/AR-195) but frankly there is not much clamour for it and not a sound from support-paying customers.

Did I mention lazy replication between content stores in a clustered environment?  Try to get a database to do that.

Regards

ronnytimmermans
Champ in-the-making
Champ in-the-making
Hallo Derek,

about performance:

retrieving an individual document is not a performance issue, even without streaming etc.

Alfresco has a base performance problem:
- browsing a directory with 500 folders = SLOW
- searching = SLOW (due to access control checks)

We loaded 3 million documents in a Alfresco archive. Do you know what was slow? Let me tell you. It costs more than a second to either check the existence of a directory or create one (or a space, if you wish). Storing the file was like 50 times faster!!!

In the total user experience, waiting for a few milliseconds more for the content is not an issue, but waiting 60 seconds for a query to return results because Alfresco has an extremely slow (SLOW) access control check is a problem. And that again is caused by the dichtomie of Lucene and the database.

You can easily export a database with the BLOB's by the way, if you want to interact with your customers.

Don't even look at the code to check this one - just use Alfresco on a large repository and let me know your user experience.

mikeh
Star Contributor
Star Contributor
Are you sure there isn't a bottleneck somewhere in your setup?

http://www.alfresco.com/media/releases/2008/01/unisys-benchmark/

Mike

ronnytimmermans
Champ in-the-making
Champ in-the-making
Not at all.

The benchmark with Unisys is not a realistic case, it just dumps documents in a repository

We
- create appropriate spaces
- check for existence of spaces before creating
- set meta-data
- write content
- commit the transaction
(via web services interface)

this with real documents, a moderate complex document model and a ratio of about 10 to 1 for files versus folders.

We spent 2 months to optimize this via internal caches and multithreading. I don't think there is a problem with the setup. Writing the content is a fast operation, to any type of store, but the other operations are slow in Alfresco. To give you one example : While high speed loading Oracle was spending a considerable amount of time rolling back transactions because Lucene could not follow the pace.

There are the performance bottlenecks.

We'll deliver you the programs so you can reproduce.

loftux
Star Contributor
Star Contributor
Ronny,
Can you please share what (if any) changes/optimizations you made to the lucene config as in
http://wiki.alfresco.com/wiki/Full-Text_Search_Configuration
I think is important knowledge, and any findings you and Alfresco support later may come up with to optimize lucene, post them here or add them to the wiki.
Possilby as a separate topic, as this thread is going slightly of topic.

Just a thought, do you have many custom properties that are indexed atomically (the default behavior). Maybe if set to false it gets queued up and the transaction can finish without waiting for lucene. Or skip indexing them at all if it is not needed.

Peter Löfgren

pmonks
Star Contributor
Star Contributor
Ronny, have you considered using a Web Script that encapsulates all of those steps in a single remotely invokable operation?  It's possible that some of the explanation for the performance you're seeing is due to the overheads of SOAP, both in terms of message parsing costs and the greater chattiness over the wire of solutions implemented using the Web Services API.

Cheers,
Peter