cancel
Showing results for 
Search instead for 
Did you mean: 

Document compression

fschnell
Champ in-the-making
Champ in-the-making
Hi,

we are currently implementing Alfresco as a replacement for our document management system and also as replacement for an existing filesystem. On the latter we have quite an amount of files ( ~ 1 million). As a storage space optimisation technique I was wondering why Alfresco does not come with one or both of the following features:

1) After indexing compress the file. That would save quite some space. Upon read request a document would be decompressed and kept in a cache for some time before being removed from the cache.

2) On our filesystem we happen to have many duplicate files. These cannot be deleted as they came as contractual deliveries. I was thinking Alfresco should support to store a hash value per file in its metadata. That way a file would need to be saved only once, even though it may appear many times in different spaces. Only if a file gets edited it would be stored as a second file as its hash value changed.

Does this make sense?  Is it thinkable that something like this gets implemented? How do others cope with the problem of data duplication?

Thanks for your valuable feedback

Frank
6 REPLIES 6

mrogers
Star Contributor
Star Contributor
I think the first point of compressing files is an interesting one but should probably be solved at the operating system or file system level rather than by Alfresco.    Perhaps readers of this thread may be able to recommend a solution?  The choice of Windows or Unix also probably affects the choices that are available.

The second point strikes me as problematic.    The use of a hashcode would be insufficient to guarantee that files are identical.   Are these physical copies of the same file or are they some sort of symbolic link?

mrogers
Star Contributor
Star Contributor
Thinking some more about this second problem.   The problem is almost a classic version control issue, the type of problem we face when when developing software on different branches of svn.

Alfresco has the AVM store and check in and check out functionality so part of the solution is already there.    Can anyone suggest an easy way to display the different "branches" in different spaces?

fschnell
Champ in-the-making
Champ in-the-making
Thinking some more about this second problem.   The problem is almost a classic version control issue, the type of problem we face when when developing software on different branches of svn.

Alfresco has the AVM store and check in and check out functionality so part of the solution is already there.    Can anyone suggest an easy way to display the different "branches" in different spaces?

You are right, a typical problem on filesystems without any satisfactory solution. While it is possible to teach software engineers on how to use SVN, it is impossible to teach the average user. Moreso, in SVN you deal with just text based information. On filesystems yiu need to cope with binary data.

The Alfrescio provided mechanism is insufficient as it requires active contribution from users. I am looking for a system which is invisible to users. It just works in the background. I do not see a technical problem to implement that.

theorbix
Confirmed Champ
Confirmed Champ
About #1, transparent file compression can be implemented at the operating system level on many operating systems.

About #2, this is a native feature of hardware storage systems commonly called Content Addressed Storage, as the EMC Centera device. I wonder if someone has ever tried to integrate Alfresco with a Centera device…

landgar
Champ in-the-making
Champ in-the-making
I´m also interested in integrating EMC Centera in Alfresco. Has anyone tried integrating it with Centera API or XAM?

mrogers
Star Contributor
Star Contributor
To answer the earlier question.   There is now an Enterprise Extension for an an XAM store.  It comes with 3.4. http://wiki.alfresco.com/wiki/Alfresco_Enterprise_3.4.0#XAM_Connector