cancel
Showing results for 
Search instead for 
Did you mean: 

Fastest file access layer (etl files from alfreso)

dhartford
Champ on-the-rise
Champ on-the-rise
Hi all,
I'm curious if anyone can provide recommendations or numbers (or discouragement) around the best way to access files from within Alfresco to be consumed by an ETL tool (SSIS, Pentaho ETL, Talend, etc).

For example most of the ETL tools could use CIFS, WebDAV, NFS, or FTP - what are people's experiences when using these services given an ETL usecase (say, 1GB or 10GB text files to be read by the ETL tool)?


(or, is there a more advanced approach if the alfresco alf_data is on a SAN that can be directly mounted by the ETL tool for read-only access)

Thanky for any input!
-D
3 REPLIES 3

scouil
Star Contributor
Star Contributor
I think talend as a native Alfresco connector but if I remember well it's only to write to Alfresco and not read from it.
Anyway, Alfresco's webservice ecosystem makes it possible to do most of the operations you want with just webservices calls. And I believe pretty much any ETL can make some REST calls.

I only tried Talend quite some time ago and had alfresco interactions at both the FTP and webscripts levels. Performance wasn't a core requirement though so I haven't benchmarked it, sorry.

mrogers
Star Contributor
Star Contributor
Streaming content out of the repo should be pretty good whatever interface you use, I'd expect the different interfaces to be fairly similar for streaming big files.   I'd consider it to something worth investigating and fixing if anything is wildly out of line.   That said I'd expect the raw content webscript to be fastest.

The best performance would be a reversal of the technology behind the "bulk import" which does a batch upload direct into the content store.   However I suspect that it may not be worth it.     The rough outline would be to find batches of the files and metadata that you want to export and then copy them directly out of the content store.     If your use case could avoid doing a copy, for example just creating a link to the content file that would be even better.

dhartford
Champ on-the-rise
Champ on-the-rise
Some research/numbers collected, looks relatively good!


simple test scenario:
277MB text file uncompressed, mixed mode access (not necessarily DOS or unix)
214704 records
older (2007) desktop as alfresco server, Pentium-D, 7200rpm drive - numbers WILL CHANGE with faster disks.
100Mbps lan to remote laptop

========Webscript==================
Pentaho Text File Input, fixed width (single field/no parsing to only measure transfer).
remote lan alfresco:webscript access - note, this appears to work even if 'move' to different repo.
http://userSmiley Tongueassword@servername:8080/share/proxy/alfresco/api/node/content/workspace/SpacesStore/bd2fc...
–average 20.1 sec
–average 10.6k rows/second
~9% alfresco server cpu
=======Webdav===============
Pentaho Text File Input, fixed width (single field/no parsing to only measure transfer).
remote lan alfresco: webdav access - note, the file needs to be in the 'Site' repository and not move.
http://userSmiley Tongueassword@servername:8080/alfresco/webdav/testfile.txt
–average 20 sec
–average 10.7k rows/second
~3% alfresco server cpu
======FTP=========
Pentaho Text File Input, fixed width (single field/no parsing to only measure transfer).
remote lan alfresco: ftp access - note, the file needs to be in the 'Site' repository and not move.
ftp://userSmiley Tongueassword@servername:2121/Alfresco/testfile.txt
–average 20.5 sec
–average 10.5k rows/second
~3% initial spike, but remainder at ~1.5% alfresco server cpu
(FTP requires privileged port access unless you use port 2121, if you have a local firewall modify to support)


========Gzip Webscript==================
Pentaho Text File Input, fixed width (single field/no parsing to only measure transfer).
remote lan alfresco:webscript access - note, this appears to work even if 'move' to different repo.
http://userSmiley Tongueassword@servername:8080/share/proxy/alfresco/api/node/content/workspace/SpacesStore/01eb1...
–average 11.9 sec
–average 18.0k rows/second
~9% alfresco server cpu
=======Gzip Webdav===============
Pentaho Text File Input, fixed width (single field/no parsing to only measure transfer), compression.
remote lan alfresco: webdav access - note, the file needs to be in the 'Site' repository and not move.
http://userSmiley Tongueassword@servername:8080/alfresco/webdav/testfile.txt.gz
–average 11.7 sec
–average 18.3k rows/second
~3% alfresco server cpu
======Gzip FTP=========
Pentaho Text File Input, fixed width (single field/no parsing to only measure transfer).
remote lan alfresco: ftp access - note, the file needs to be in the 'Site' repository and not move.
ftp://userSmiley Tongueassword@servername:2121/Alfresco/testfile.txt.gz
–average 11.7 sec
–average 18.3k rows/second
~3% initial spike, but remainder at ~1.5% alfresco server cpu
(FTP requires privileged port access unless you use port 2121, if you have a local firewall modify to support)

=======Local disk baseline===========
–average 12.5 sec
–average 17.2k rows/second
===================


Note that CIFS was not tested (needed to integrate/setup with AD, I've done it before, but bit of a pain).  Most resources mention CIFS requires quite a bit more CPU to use, and is likely to perform similar, but likely less, than the other methods.

Note that CMIS was not tested as was not functioning with the Pentaho ETL tool being used for testing.  I suspect the ?id=workspace://… syntax is not compatible with the VFS-layer used by the Pentaho ETL tool.
Getting started

Tags


Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.