Hyland Connect

spoogegibbon · ‎11-13-2013

Since this post was published there has been a HAProxy 1.5(.x) release version, so this post is now out of date.

An updated post with the changes relevant to HAProxy 1.5 can be are here: https://www.alfresco.com/blogs/devops/?p=8

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

For the cloud service we (Alfresco DevOps) used to use apache for all our load balancing and reverse proxy use, but more recently we switched to use HAProxy for this task.

In this article I'll list some of the settings we use, and give a final example that could be used (with a bit of environment specific modifications) for a general Alfresco deployment.

The main website for HAProxy is: http://haproxy.1wt.eu/

The docs can be found here: http://cbonte.github.io/haproxy-dconv/configuration-1.5.html

I suggest that for any of the settings covered in the rest of this article, the HAProxy docs are consulted to gain a deeper understanding of what they do.

The 'global' section:

global

pidfile /var/run/haproxy.pid

log 127.0.0.1 local2 info

stats socket /var/run/haproxy.stat user nagios group nagios mode 600 level admin

A quick breakdown of these:

global - defines global settings.

pidfile - Writes pids of all daemons into file <pidfile>.

log - Adds a global syslog server. Optional

stats socket - Sets up a statistics output socket. Optional

The 'defaults' section:

defaults

mode http

log global

A quick breakdown of these:

defaults - defines the default settings

mode - sets the working mode to http (rather than tcp)

log - sets the log context

Now we configure some options that specify how HAProxy works, these options are very important to get your service working properly:

option httplog

option dontlognull

option forwardfor

option http-server-close

option redispatch

option tcp-smart-accept 

option tcp-smart-connect

These options do the following:

option httplog - this enables logging of HTTP request, session state and timers.

option dontlognull - disable logging of null connections as these can pollute the logs.

option forwardfor - enables the insertion of the X-Forwarded-For header to requests sent to servers.

option http-server-close - enable HTTP connection closing on the server side. See the HAProxy docs for more info on this setting.

option redispatch - enable session redistribution in case of connection failure, which is important in a HA environment.

option tcp-smart-accept - this is a performance tweak, saving one ACK packet during the accept sequence.

option tcp-smart-connect - this is a performance tweak, saving of one ACK packet during the connect sequence.

Next we define the timeouts - these are fairly self-explanatory:

timeout http-request 10s

timeout queue 1m

timeout connect 5s

timeout client 2m

timeout server 2m

timeout http-keep-alive 10s

timeout check 5s

retries 3

We then configure gzip compression to reduce the amount of data being sent across the wire - I'm sure no configuration ever misses out this easy performance optimisation:

compression algo gzip

compression type text/html text/html;charset=utf-8 text/plain text/css text/javascript application/x-javascript application/javascript application/ecmascript application/rss+xml application/atomsvc+xml application/atom+xml application/atom+xml;type=entry application/atom+xml;type=feed application/cmisquery+xml application/cmisallowableactions+xml application/cmisatom+xml application/cmistree+xml application/cmisacl+xml application/msword application/vnd.ms-excel application/vnd.ms-powerpoint

The next section is some error message housekeeping. Change these paths to wherever you want to put your error messages:

errorfile 400 /var/www/html/errors/400.http

errorfile 403 /var/www/html/errors/403.http

errorfile 408 /var/www/html/errors/408.http

errorfile 500 /var/www/html/errors/500.http

errorfile 502 /var/www/html/errors/502.http

errorfile 503 /var/www/html/errors/503.http

errorfile 504 /var/www/html/errors/504.http

Now we have finished setting up all our defaults, we can start to define our front ends (listening ports).

We first define our frontend on port 80. This just does a redirect to the https frontend:

# Front end for http to https redirect

frontend http

bind *:80

redirect location https://my.yourcompany.com/share/

Next we define our https frontend which is where all traffic to Alfresco is handled:

# Main front end for all services

frontend https 

bind *:443 ssl crt /path/to/yourcert/yourcert.pem

capture request header X-Forwarded-For len 64

capture request header User-agent len 256

capture request header Cookie len 64

capture request header Accept-Language len 64

We now get into the more 'fun' part of configuring HAProxy - setting up the acls.

These acls are the mechanism used to match requests to the service to the appropriate backend to fulfil those requests, or to block unwanted traffic from the service. I suggest that if you are unfamiliar with HAProxy that you have a good read of the docs for acls and what they can achieve (section 7 in the docs).

We separate out all the different endpoints for Alfresco into their own sub-domain name, e.g. my.alfresco.com for share access, webdav.alfresco.com for webdav, sp.alfresco.com for sparepoint access.

I'll use these three endpoints in the examples below, using the following mapping:

Share - my.yourcompany.com

Webdav - webdav.yourcompany.com

Sharepoint - sp.yourcompany.com

We first set up some acls that check the host name being accessed and match on those. Anything coming in that doesn't match these won't get an acl associated (and therefore won't get forwarded to any service).

# ACL for backend mapping based on host header

acl is_my hdr_beg(host) -i my.yourcompany.com

acl is_webdav hdr_beg(host) -i webdav.yourcompany.com

acl is_sp hdr_beg(host) -i sp.yourcompany.com

These are in the syntax:

acl acl_name match_expression case_insensitive(-i) what_to_match

So, acl is_my hdr_beg(host) -i my.yourcompany.com states:

acl - define this as an acl.

is_my - give the acl the name 'is_my'.

hdr_beg(host) - set the match expression to use the host HTTP header, checking the beginning of the value.

-i - set the check to be case insensitive

my.yourcompany.com - the value to check for.

We then do some further mapping based on url paths in the request using some standard regex patterns:

# ACL for backend mapping based on url paths

acl robots path_reg ^/robots.txt$

acl alfresco_path path_reg ^/alfresco/.*

acl share_path path_reg ^/share/.*/proxy/alfresco/api/solr/.*

acl share_redirect path_reg ^$|^/$

These do the following:

acl robots - checks for a web bot harvesting the robots.txt file

acl alfresco_path - checks whether the request is trying to access the alfresco webapp. We block direct access to the Alfresco Explorer webapp so you can remove this check if you want that webapp available for use.

acl share_path - We use this to block direct access to the Solr API.

acl share_redirect - this checks whether there is any context at the end of the request (e.g. /share)

We next add in some 'good practice' - a HSTS header. You can find out more about HSTS here: https://www.owasp.org/index.php/HTTP_Strict_Transport_Security

Note, my.alfresco.com is in the internal HSTS list in both Chrome and Firefox so neither of these browsers will ever try to access the service using plain http (see http://www.chromium.org/sts).

# Changes to header responses

rspadd Strict-Transport-Security:\ max-age=15768000

We next set up some blocks, you can ignore these if you don't want to limit access to any service. The example below blocks access to the Alfresco Explorer app from public use via the 'my.yourcompany.com' route. These use matched acls from earlier, and can include multiple acls that must all be true.

# Blocked paths

block if alfresco_path is_my

Now we redirect to /share/ if this wasn't in the url path used to access the service.

# Redirects

redirect location /share/ if share_redirect is_my

Next we set up the list of backends to use, matched against the already defined acls.

# List of backends

use_backend share if is_my

use_backend webdav if is_webdav

use_backend sharepoint if is_sp

Then we set up the default backend to use as a catch-all:

default_backend share

Now we define the backends, the first being for share:

backend share

On this backend, enable the stats page:

# Enable the stats page on share backend

stats enable

stats hide-version

stats auth <user>:<password>

stats uri /monitor

stats refresh 2s

The stats page gives you a visual view on the health of your backends and is a very powerful monitoring tool.

option httpchk GET /share

balance leastconn

cookie JSESSIONID prefix

server tomcat1 server1:8080 cookie share1 check inter 5000

server tomcat2 server2:8080 cookie share2 check inter 5000

These define the following:

backend share - this defines a backend called share, which is used by the use_backend config from above.

option httpchk GET /share - this enables http health checks, using a http GET, on the /share path. Server health checks are one of the most powerful feature of HAProxy and works hand in hand with tomcat session replication to move an active session to another server if the server your active session on fails healthchecks.

balance leastconn - this sets up the balancing algorithm. leastconn selects the server with the lowest number of connections to receive the connection.

cookie JSESSIONID prefix - this enables cookie-based persistence in a backend. Share requires a sticky session and this also is used in session replication.

server tomcat1 server1:8080 cookie share1 check inter 5000 - this breaks down into:

server - this declares a server and its parameters

tomcat1 - this is the server name and appears in the logs

server1:8080 - this is the server address (and port)

cookie share1 - this checks the cookie defined above and if matched routes the user to the relevant server. The 'share1' value has to match the jvmroute set on the appserver for Share/Alfresco (for Tomcat see http://tomcat.apache.org/tomcat-7.0-doc/cluster-howto.html)

check inter 5000 - this sets the health check, with an inter(val) of 5000 ms

Define the webdav backend.

Here we hide the need to enter /alfresco/webdav on the url path which gives a neater and shorter url needed to access webdav, and again we enable server health checking:

backend webdav

option httpchk GET /alfresco

reqrep ^([^\ ]*)\ /(.*) \1\ /alfresco/webdav/\2

server tomcat1 server1:8080 check inter 5000

server tomcat2 server2:8080 check inter 5000

Define the SPP backend.

Here we define the backend for the sharepoint protocol, again with health checks:

backend sharepoint

balance url_param VTISESSIONID check_post

cookie VTISESSIONID prefix

server tomcat1 server1:7070 cookie share1 check inter 5000

server tomcat2 server2:7070 cookie share2 check inter 5000

Once this is all in place you should be able to start HAProxy. If you get any errors you will be informed on which lines of the config these are in. Or, if you have HAProxy as a service, you should be able to run 'service haproxy check' to check the config without starting HAProxy.

There are many more cool things you can do with HAProxy, so give it a go and don't forget to have a good read of the docs!

Hyland Connect

HAProxy for Alfresco