Load balancing a network protocol is something quite common nowadays. There are loads of ways to do that for HTTP for instance, and generally speaking all "single flow" protocols can be load-balanced quite easily. However, some protocols are not as simple as HTTP and require several connections. This is exactly what is FTP.
Let's take a deeper look at the FTP protocol, in order to better understand how we can load-balance it. In order for an FTP client to work properly, two connections must be opened between the client and the server:
The control connection is initiated by the FTP client to the TCP port 21 on the server. On the other end, the data connection can be created in different ways. The first way is the through an "active" FTP session. In this mode the client sends a "PORT" command which randomly opens one of its network port and instruct the server to connect to it using port 20 as source port. This mode is usually discouraged or even server configuration prevent it for security reasons (the server initiate the data connection to the client). The second FTP mode is the "passive" mode. When using the passive mode a client sends a "PASV" command to the server. As a response the server opens a TCP port and sends the number and IP address as part of the PASV response so the client knows what socket to use. Modern FTP clients usually use this mode first if supported by the server. There is a third mode which is the "extended opassive" mode. It is very similar to the "passive" mode but the client sends an "EPSV" command (instead of "PASV") and the server respond with only the number of the TCP port that has been chosen for data connection (without sending the IP address).
So now that we know how FTP works we also know that load-balancing FTP requires balancing both the control connections and the data connections. The load balancer must also make sure that data connections are sent the right backend server, the one which replied to the client command.
From your ECM side, there is not much to do but there are some pre-requisites:
The Alfresco configuration presented bellow is valid for both load balancing method presented later. Technically every bit of this Alfresco configuration is not required, depending on the method you choose, but applying the config as shown will work on both cases.
First of all you should prefer setting FTP options in the alfresco-global.properties file as Alfresco cluster nodes have different settings, which you may not set using either the admin-console or the JMX interface.
If you have already set FTP parameters using JMX (or the admin-console), those parameters are persisted in the database and need to be remove from there (using the "revert" action in JMX for example).
Add the following to your alfresco-global.properties and restart Alfresco:
### FTP Server Configuration ###
ftp.enabled=true
ftp.port=2121
ftp.dataPortFrom=20000
ftp.dataPortTo=20009
ftp.dataPortFrom and ftp.dataPortTo properties need to be different on all servers. So if there were 2 Alfresco nodes alf1 and alf2, the properties for alf2 could be:
ftp.dataPortFrom=20010
ftp.dataPortTo=20019
Keepalived is a Linux based load-balancing system. It wraps the IPVS (also called LVS) software stack from the Linux-HA project and offer additional features like backend monitoring and VRRP redundancy. The schema bellow shows how Keepalived proceed with FTP load-balancing. It tracks control connection on port 21 and dynamically handles the data connections using a Linux kernel module called "ip_vs_ftp" which inspect the control connection in order to be aware of the port that will be used to open the data connection.
Configuration steps are quite simple.
First install the software:
sudo apt-get install keepalived
Then create a configuration file using the sample:
sudo cp /usr/share/doc/keepalived/samples/keepalived.conf.sample /etc/keepalived/keepalived.conf
Edit the newly created file in order to add a new virtual server and the associated backend servers: virtual_server
192.168.0.39 21 {
delay_loop 6
lb_algo rr
lb_kind NAT
protocol TCP
real_server 10.1.2.101 2121 {
weight 1
TCP_CHECK {
connect_port 2121
connect_timeout 3
}
}
real_server 10.1.2.102 2121 {
weight 1
TCP_CHECK {
connect_port 2121
connect_timeout 3
}
}
}
In a production environment you will most certainly want to use an additional VRRP instance to ensure a highly available load balancer. Please refer to the Keepalived documentation in order to set that up or just use the example given in the distribution files.
The example above defines a virtual server that listen on socket 192.168.0.39:21
. Connections sent to this socket are redirected to backend servers using round-robin algorithm (others are available) and after masquerading source IP address. Additionally we need to load the FTP helper in order to track FTP data connections:
echo 'ip_vs_ftp' >> /etc/modules
It is important to note that this setup leverage the ftp kernel helper which reads the content of FTP frames. This means that it doesn't work when FTP is secured using SSL/TLS
Before you go any further:
This method has a huge advantage: it can handle FTPs (SSL/TLS). However, it also have a big disadvantage: it doesn't work when the load balancer behaves as a NAT gateway (which is basically what HAProxy does).
This is mainly because at the moment Alfresco doesn't comply with the necessary pre-requisites for secure FTP to work.
A JIRA has been raised in order to fix this:
Some FTP clients may work even with this limitation. It may happen to work if server is using ipv6 or for clients using the "Extended Passive Mode" on ipv4 (which is normally used for ipv6 only). To better understand how, please see FTP client and passive session behing a NAT.
This means that what's bellow will mainly only work with macOSX ftp command line and probably no other FTP client!
Don't spend time on it and use previous method if you need other FTP clients or if you have no control over what FTP client your users have.
This method can also be adapted to Keepalived using iptables mangling and "fwmark" (see Keepalived secure FTP), but you should only need it if you are bound to FTPs as normal FTP is much better handled by previous method.
HAProxy is a modern and widely used load balancer. It provides similar features as Keepalived and much more. Nevertheless HAProxy is not able to track data connections as related to the global FTP session. For this reason we have to trick the FTP protocol in order to provide connection consistency within the session. Basically we will split the load balancing in several parts:
So if we have 2 backend servers - as shown in the schema bellow - we will create 3 load balancing connection pools (let's called it like this for now).
First install the software:
sudo apt-get install haproxy
HAProxy has the notion of "frontends" and "backends". Frontends allow to define specific sockets (or set of sockets) each of which can be linked to different backends. So we can use the configuration bellow:
frontend alfControlChannel
bind *:21
default_backend alfPool
frontend alf1DataChannel
bind *:20000-20009
default_backend alf1
frontend alf2DataChannel
bind *:20010-20019
default_backend alf2
backend alfPool
server alf1 10.1.2.101:2121 check port 2121 inter 20s
server alf2 10.1.2.102:2121 check port 2121 inter 20s
backend alf1
server alf1 10.1.2.101:2121 check port 2121 inter 20s
backend alf2
server alf2 10.1.2.102:2121 check port 2121 inter 20s
So in this case the frontend that handle the control connection load-balancing (alfControlChannel) alternatively sends requests to all backend server (alfPool). Each server (alf1 & alf2) will negotiate a data transfer socket on a different frontend (alf1DataChannel & alf2DataChannel). Each of this frontend will only forward data connection to the only corresponding backend (alf1 or alf2), thus making the load balancing sticky. And... job done!