cancel
Showing results for 
Search instead for 
Did you mean: 

does alfresco store documents as files or in db? any limits?

bhavin_t
Champ in-the-making
Champ in-the-making
currently we have a home-grown knowledge management system layered on top of SVN. i am investigating alfresco

one of the questions i had was how does alfresco store its files? this questiobn is important for the following reasons -

* one of the resources within our company is VIDEOS. These videos are huge files of 900 MB or 1 GB.

* i can upload them to alfresco through CFIS

* however i do not want alfresco to version them, cuz if i work on a newer version of the same video and upload it, and alfresco ends up storing two copies for versioning, then it will result in excessive diskspace utilization

* additionally if alfresco stores the entire file inside a db or something then there will be a conversion process to actually insert the file into the DB. therefore ideally i would like to store these video files in the file system within alfresco

my question therefore is -

* can i specify the store that alfresco uses for a set of files?

* can i specify certain files to NOT get versioned?

thanks
- Bhavin
6 REPLIES 6

kevinr
Star Contributor
Star Contributor
Alfresco stores file content on the file system by default - only meta-data is stored in the database, so this should suit your needs.

Alfresco does not version content by default - you need to actually enable versioning for a folder/type or the entire system.

Thanks,

Kevin

akohlsmith
Champ in-the-making
Champ in-the-making
Alfresco stores file content on the file system by default - only meta-data is stored in the database, so this should suit your needs.

That's oversimplifying a little.  Yes, the content is stored in the filesystem, but the names are mangled into something totally unusable outside of Alfresco.  I've been unable to, at this point, find a mapping between the filesystem name and the actual content properties (name, author, description, etc.) inside the database.  Could you or anyone else point me off into the right direction?

kevinr
Star Contributor
Star Contributor
Alfresco stores file content on the file system by default - only meta-data is stored in the database, so this should suit your needs.

That's oversimplifying a little.  Yes, the content is stored in the filesystem, but the names are mangled into something totally unusable outside of Alfresco.  I've been unable to, at this point, find a mapping between the filesystem name and the actual content properties (name, author, description, etc.) inside the database.  Could you or anyone else point me off into the right direction?

True, and there are good reasons for that. Changing the names means that we don't have to worry about the underlying filesystem storing utf-8 or similar filenames that may cause problems, it's also good for security of the files. The mapping is via the {http://www.alfresco.org/model/content/1.0}content property in the node_properties table, the contentUrl is part of the value stored in the string_value column. Obviously you should only ever use the Alfresco APIs to change these values. All of the services in the Alfresco repository are pluggable, so the way that files are stored could be changes without affecting the rest of the system.

Thanks,

Kevin

akohlsmith
Champ in-the-making
Champ in-the-making
True, and there are good reasons for that. Changing the names means that we don't have to worry about the underlying filesystem storing utf-8 or similar filenames that may cause problems, it's also good for security of the files.

Security through obscurity is no security at all.  If your server is blown open they also have the mapping tables, which means that there is no enhanced security through renaming the files like this.  Also, chances are that you're already making everything very, very open through your CIFS, FTP or webDAV access methods.

The "we don't worry about screwy filenames" is a very good argument though.  Portability is *definitely* enhanced through this.

The mapping is via the {http://www.alfresco.org/model/content/1.0}content property in the node_properties table, the contentUrl is part of the value stored in the string_value column.

I can't seem to find the mapping.  I can use
SELECT guid FROM node_properties WHERE qname = '{http://www.alfresco.org/model/content/1.0}name' AND string_value = 'MX Control.doc';

and get back a guid of '74ecdfb5-f7d0-11da-baf4-8975711acf00', but I can't find that anywhere in the filesystem:
# find /opt/alfresco/alf_data/contentstore/2006 -iname '74ecdfb5-f7d0-11da-baf4-8975711acf00*'
#

Now if I muck about with the filename a little I do find something:
# find /opt/alfresco/alf_data/contentstore/2006 -iname '*-f7d0-11da-baf4-8975711acf00*'
/opt/alfresco/alf_data/contentstore/2006/6/9/11/74f34857-f7d0-11da-baf4-8975711acf00.bin
# file /opt/alfresco/alf_data/contentstore/2006/6/9/11/74f34857-f7d0-11da-baf4-8975711acf00.bin
/opt/alfresco/alf_data/contentstore/2006/6/9/11/74f34857-f7d0-11da-baf4-8975711acf00.bin: Microsoft Office Document
#

But that isn't very scientific.  How does one use the database to precisely locate the path and physical filename of content stored within Alfresco?

Obviously you should only ever use the Alfresco APIs to change these values. All of the services in the Alfresco repository are pluggable, so the way that files are stored could be changes without affecting the rest of the system.

Absolutely, but for read-only access there should be a way to do this.  I haven't benchmarked Alfresco's CIFS/webDAV/FTP access yet but my initial impression is that its performance is a little lacking, especially compared to "raw" FTP/Samba performance on the same physical hardware.  It's understandable given that it's a content management system rather than a simple file server but before I decide to throw thousands of documents totaling gigabytes of data into it, I want to have a very clear and low-level understanding of where my data is and how to get at it in the event of a problem.

kevinr
Star Contributor
Star Contributor
True, and there are good reasons for that. Changing the names means that we don't have to worry about the underlying filesystem storing utf-8 or similar filenames that may cause problems, it's also good for security of the files.

Security through obscurity is no security at all.  If your server is blown open they also have the mapping tables, which means that there is no enhanced security through renaming the files like this.  Also, chances are that you're already making everything very, very open through your CIFS, FTP or webDAV access methods.

Woooaah there Smiley Happy I wasn't trying to say it's _for_ security - just that it can't hurt to do so - the filename character issue is the main reason.

Alfresco is secure via all APIs and interfaces - there is nothing "open" - they all require authentication to allow you to browse anything beyond the "Guest" folder - and even that can be disabled via config.

The "we don't worry about screwy filenames" is a very good argument though.  Portability is *definitely* enhanced through this.

The mapping is via the {http://www.alfresco.org/model/content/1.0}content property in the node_properties table, the contentUrl is part of the value stored in the string_value column.

I can't seem to find the mapping.  I can use
SELECT guid FROM node_properties WHERE qname = '{http://www.alfresco.org/model/content/1.0}name' AND string_value = 'MX Control.doc';

and get back a guid of '74ecdfb5-f7d0-11da-baf4-8975711acf00', but I can't find that anywhere in the filesystem:
# find /opt/alfresco/alf_data/contentstore/2006 -iname '74ecdfb5-f7d0-11da-baf4-8975711acf00*'
#

Now if I muck about with the filename a little I do find something:
# find /opt/alfresco/alf_data/contentstore/2006 -iname '*-f7d0-11da-baf4-8975711acf00*'
/opt/alfresco/alf_data/contentstore/2006/6/9/11/74f34857-f7d0-11da-baf4-8975711acf00.bin
# file /opt/alfresco/alf_data/contentstore/2006/6/9/11/74f34857-f7d0-11da-baf4-8975711acf00.bin
/opt/alfresco/alf_data/contentstore/2006/6/9/11/74f34857-f7d0-11da-baf4-8975711acf00.bin: Microsoft Office Document
#

But that isn't very scientific.  How does one use the database to precisely locate the path and physical filename of content stored within Alfresco?

Obviously you should only ever use the Alfresco APIs to change these values. All of the services in the Alfresco repository are pluggable, so the way that files are stored could be changes without affecting the rest of the system.

Absolutely, but for read-only access there should be a way to do this.  I haven't benchmarked Alfresco's CIFS/webDAV/FTP access yet but my initial impression is that its performance is a little lacking, especially compared to "raw" FTP/Samba performance on the same physical hardware.  It's understandable given that it's a content management system rather than a simple file server but before I decide to throw thousands of documents totaling gigabytes of data into it, I want to have a very clear and low-level understanding of where my data is and how to get at it in the event of a problem.

That's fine and you are probably right, going to the database directly is always going to be faster than any framework that adds value ontop of it.

Sorry I did not give you all the info you need to write a query. If you have the GUID of a content node (from the NodeRef in Alfresco) then you can extract the string_value column from the content property of that node by joining the node table to the node_properties table:

select string_value from node_properties p join node n on n.id = p.node_id where p.qname='{http://www.alfresco.org/model/content/1.0}name' and n.uuid = 'XXXX';

That returns the content URL field - which contains all the info you need on the content file and looks something like this:

contentUrl=store://2006/5/22/17/16c6cf0a-e9ae-11da-ac1c-8900bf14b17f.bin|mimetype=text/plain|size=609|encoding=UTF-8

It includes the folder path (relative to the usual "./alf_data/contentstore" directory) and the mimetype and size of the content compacted into a single string value. That should give you enough to find the content on the disk.

Hope this helps,

Kevin

hm
Champ in-the-making
Champ in-the-making
Just a minor correction:

Use

{http://www.alfresco.org/model/content/1.0}content

instead of

{http://www.alfresco.org/model/content/1.0}name

or better yet use this query:

select * from node_properties where node_id in (select id from node where uuid = 'XXXXX');