cancel
Showing results for 
Search instead for 
Did you mean: 

Adding Very Large files

csbrown
Champ in-the-making
Champ in-the-making
I'm wondering how to use the WebService to put files in alfresco that are very large in size, meaning larger than I want to store in a byte[] to eat up all my memory.  The only method I see on the WebService interface for writing to a content node is the ContentServiceSoapBindingStub.write(..) method. 

Is there any other way to give it a stream to read from as it writes the content?

Perhaps multiple calls to the write() method would work.  Is that safe?

Thanks,

Colby
15 REPLIES 15

rwetherall
Confirmed Champ
Confirmed Champ
Hi,

At the moment the only way to write content to the server via the web service interface is by using the write method and setting the byte[] parameter.

I appreciate your concerns around large file using this method of upload.  It is certainly an area we need to expand to allow the web service interface to cope better with very large files.

Calling write multiple times currently won't help as it will overwrite the existing content on every call.  Perhaps we need an append method?

Another possibility would be to attach the content file to the request directly as an attachment, then the content could be streamed as you suggest?

I'll create a Jira task to ensure this gets followed through.

Cheers,
Roy

rwetherall
Confirmed Champ
Confirmed Champ

csbrown
Champ in-the-making
Champ in-the-making
Thanks Roy.

I was originally looking for a method that took a stream, but I guess that doens't make send when dealing with a Web Service interface.

I like the idea of of attaching the file as an attachment.  That seems like it would be a sound solution and one that would fit well with Web Services.  I'll keep an eye on the Jira issue.

sirdodger
Champ in-the-making
Champ in-the-making
I'm trying to upload reasonably large files too.  (For reference, I'm running from .NET under windows, and trying to upload a ~100MB file.)

The contentWebService.write() call seems to be trying to allocate a buffer the size of the file on the network interface and failing with an error like "An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full".

The contentWebService.writeAsync() call seems to be trying to convert my entire file to a base64 string, which is failing with an OutOfMemoryException.

Do you have suggestions on other mechanisms by which I can upload the file from my .NET client to the repository? 

I didn't really understand the last poster's idea about making it an "attachment", or why accepting a stream to the web service interface is a bad idea.

csbrown
Champ in-the-making
Champ in-the-making
I haven't seen much activity on this issue and it doesn't look like the Jira issue has been touched since it was first created. 

We still have a need for this and don't currently have any workarrounds.  Luckily, we haven't ran into a file large enough to break anything, but it sounds like you have.   The only thing I might suggest is writing it to a file system folder that alfresco could monitor and pickup new files from directly. 

Not sure how good of a solution that is, but it might work until the Alfresco team gets things going for this issue.

sirdodger
Champ in-the-making
Champ in-the-making
Is there a builtin mechanism to Alfresco that has it monitor local directories for modifications?  Or would I have to modify Alfresco's source?

rdanner
Champ in-the-making
Champ in-the-making
Hi,

At the moment the only way to write content to the server via the web service interface is by using the write method and setting the byte[] parameter.

I appreciate your concerns around large file using this method of upload.  It is certainly an area we need to expand to allow the web service interface to cope better with very large files.

Calling write multiple times currently won't help as it will overwrite the existing content on every call.  Perhaps we need an append method?

Another possibility would be to attach the content file to the request directly as an attachment, then the content could be streamed as you suggest?

I'll create a Jira task to ensure this gets followed through.

Cheers,
Roy

Roy,

Doesn't the append mechansim make more sense in terms of web service API?  When you first consider it anyway.  Although the drawbacks come flooding in.   The client has to manage the fragmentation of the file which is awkward an error prone etc. 

I know that GLOBUS which is a grid computing platform has some capabilities for moving large sets of data around.  It might be worth while looking at what they are doing. I have lost touch with them for a couple a while but they where moving very heavily in the direction of WS and I am curious what they have done with thier data handling.  Maybe they have a model that is suitable.

rdanner
Champ in-the-making
Champ in-the-making
Hi,

At the moment the only way to write content to the server via the web service interface is by using the write method and setting the byte[] parameter.

I appreciate your concerns around large file using this method of upload.  It is certainly an area we need to expand to allow the web service interface to cope better with very large files.

Calling write multiple times currently won't help as it will overwrite the existing content on every call.  Perhaps we need an append method?

Another possibility would be to attach the content file to the request directly as an attachment, then the content could be streamed as you suggest?

I'll create a Jira task to ensure this gets followed through.

Cheers,
Roy

Roy,

Doesn't the append mechansim make more sense in terms of web service API?  When you first consider it anyway.  Although the drawbacks come flooding in.   The client has to manage the fragmentation of the file which is awkward an error prone etc. 

I know that GLOBUS which is a grid computing platform has some capabilities for moving large sets of data around.  It might be worth while looking at what they are doing. I have lost touch with them for a couple a while but they where moving very heavily in the direction of WS and I am curious what they have done with thier data handling.  Maybe they have a model that is suitable.

Well I searched through my old GLOBUS stuff and didnt find much.  I knew that GLOBUS used a grid version of FTP and that it has been superceded by RFTS (Reliable File Transfer Service) but I wanted to find out if they had analogous web services.  I didnt see any but my material is old.

Let me play the role of stupid guy here (it comes naturaly to me)

how very large are we talking?

what is the limiting factor on the file transfer?

how intellegent is the client?

is the alfresco aware (you are consuming an alfresco service) component an adapter or is the whole application aware of afresco?

how important is security on the source end?

how important is security on the target end?

are you behind any firewalls?  whats the topology we are talking about?

The current web service API is inherently a push model.  Have you considered other models?


There are a lot of things that could be done to move huge files in a very secure fashion.  This kind of thing usually requires some coordination.  FTP for example opens a control channel and a transfer channel.   Most clients are not going to want to "think" about all of this if it is possible to avoid it. 

If you have a need to transfer really gigantic files then it is likely that security is an issue to you. Certainly when transfering files of immense size accuracy is critical.  I dont want to sent a ton of data only to find out the middle of the file is currupt or missing. 

If such is the case I would consider looking at my topology and my method.  One interesting mechanism that will be available sooner or later is Alfresco ferderation.   The current detail on the roadmap for federation is focused around federated search services.  The question is whether or not there will be a mechanism for two repositories to share content as peers, or in dominant relationships. 

There is nothing like that functionality on the roadmap but lets speculate on the possibilities a little and see if it has any merit where this problem is concerned.  I really dont know anything about your requirements either so I'll just throw this out there for fun.


One of the possible topologies might be to have a local alfresco repository.  It is very easy then to get the content into the local system, as simple as writing it down to the file system. 

The value here that your application could care less about the responsibility of pushing data, and lots of it to some remote location, dealing with the correctness, and security of the transfer.

You would only need to deliver the data locally.  The federated repsitory is then responsible to move the information to where it needs to go.  It has to make sure it can make multiple calls if that is needed, make sure the transfer is accurate, make sure it happened securely, make sure it all completes successfully. 

– you on the other hand could care less about all of that from the perspective of the application… thats not your problem, you just wanted to get data into the repository.

I am going to go dig around the GLOBUS area and see if any new goodies are available.  If you haven't checked out GLOBUS then you should.  It is cool.

rdanner
Champ in-the-making
Champ in-the-making
Is there a builtin mechanism to Alfresco that has it monitor local directories for modifications?  Or would I have to modify Alfresco's source?

I almost missed this post.  Glad I caught it.  I have a similar need so it seems.  I have to integrate a publishing system (we are soon to pick to vendor) Each only support a "hot folder" machism for integration.  Hot folder is where they can pick up or drop off files (and an xml ticket) in a folder on some periodic basis.  I dont really care for the solution of course as it has more then a handful of problems but it is simple. 

This question also relates to my comments I made about VLF (very large files). 

I have already added this to the community roadmap wish list (http://wiki.alfresco.com/wiki/Community_Roadmap_Suggestions) at the following wiki page http://wiki.alfresco.com/wiki/Hot_Folders

What would be really helpful is if you an I and any other people who need hot folders could start to put definition around this on the wiki.

What are our requirements? 

For example do we need the file to be picked up immediately?  Because that will have an impact on our approach.  Our approach would take into account that differnt people have differnt needs and assumptions.  For example we could have a plugable strategy for detecting files in the hot folder.

We could plug in a simple java based polling mechansim.

We could plug in a scheduled pickup mechansim

We might need to have plugable libraries that allow us to register for OS filesystem events that can pick up the file the minute they show up or the minute the are closed by the sender. 

WIN32 offers a WaitForChangeNotification mechansim  (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wcedata5/html/wce50lrfFindFirstChan...)

Linux offers offers dnotify which has been replaced by inotify (http://stefan.buettcher.org/cs/fschange/index.html)

We might have all kinds of requirements such as sending metadata along with the file or passing processing instructions to the inbound repository.

We might need to drop off file requests:  E.g. I am here for such and such a file, drop the file in this folder so I can take it away.

Some of us will have security concerns, authentication, authorization, auditing, and administration needs.



The next step is to start and collect these needs.  Then we can work with alfresco and the community to get the work on the roadmap and out the door.