Tomcat Session Replication with BackupManager
My current project is using BackupManager for session replication in Tomcat and I wanted to understand this a bit more. I set up a quick example to try it out on and thought that my experiences might be of some use to other people who are using this approach (or are considering it).
We can run a cluster of Tomcat servers in a number of ways:
- No session replication — if the Tomcat server a customer is using dies, they lose that session (e.g. the contents of their shopping basket)
- DeltaManager — this replicates sessions to all Tomcat servers in the cluster. Meant to be good for small clusters (2 or 3 servers) but too chatty for larger clusters
- BackupManager — this replicates sessions to one other server and lets all other servers in the cluster know where that session is stored. Meant to be better for larger clusters (maybe up to 7 or 8 servers)
- Shared sessions — the Tomcat servers do not replicate the sessions, but the storage for the session is shared across all servers. This is more complex to set up, but can handle larger Tomcat clusters well
The remainder of this will just be about BackupManager.
My Test Environment
To try out the Tomcat BackupManager, I am running four Tomcat servers (on a single host) and using HA Proxy to load balance across them. For this configuration to work, I also needed to enable sticky sessions on HA Proxy. Because this is all running on one host, the Tomcat servers and HA Proxy all need to be listening on different hosts.
I also put together some housekeeping scripts which allow me to easily set up the environment and run simple tests on it.
To try this all out, the first thing I did was fire it up without clustering.
Note that in the following examples, you can ignore the references to Varnish. I have included Varnish in this environment for other reasons and it is not used in this BackupManager example.
We can now check that everything’s working okay, by checking the HA Proxy admin page which should show that HA Proxy itself is working and that all the Tomcat servers are running and load-balanced:
Everything looks good there (loads of green) so now we can move on to look at the simple test application I put together for this:
As you can see it’s just a simple page which displays the information I am interested in – which Tomcat server has served the page, the session ID and a timestamp (so that we know we are getting a fresh page and not something cached somewhere).
Back at the command line, I have also put together a script which checks which sessions are stored in each Tomcat server:
As you can see, this ties in with the web page we’ve just seen – the only session is in Tomcat server #2 and the session ID matches the one displayed in the web page.
Last thing to do is to check that the session stickiness is working, which I have a test script to do:
The above demonstrates that session stickiness is working (consecutive requests via HA Proxy are sent to the same Tomcat server) and also that when no session is established, HA Proxy uses a round robin approach (when the cookies are cleared it switches to another server and all four servers are used over four sets of requests).
Finally, the session table shows that all the sessions end up in the correct Tomcat servers (and it includes the session we ran via the browser at the start).
For those interested in the details, the test script uses the WWW::Mechanize module (from Perl CPAN) to give a programmatic version of a browser, which handles cookies, etc. The HTML returned is parsed to pull out the server name and session ID (as shown in the browser session earlier) and the whole thing is wrapped in a test harness to check that the results coming back from the servers are in line with expectations.
Use of BackupManager
Now that we can see our environment is behaving as expected without session clustering, it’s time to rebuild the environment with clustering enabled:
We can now run the same test script we ran earlier and check what the results look like:
The actual tests themselves run the same way and produce the same results, but the table of sessions in the Tomcat servers is different. Instead of containing just the 8 sessions IDs you would get without clustering, there are now 16 – since each session has a replicated session in another server.
The session from the last test is highlighted in the table and illustrates the point. As expected, the session ID is present in Tomcat server #4, since that is the server we got the page from. However, it is also present in Tomcat server #3 – this is a replication of the session, courtesy of BackupManager. If you like, you can go through and check all the other sessions created during the tests and see the same effect.
Obviously, this means that if Tomcat server #4 crashes unexpectedly, the customer can continue their browsing experience without any knowledge this has happened – since their session can be picked up elsewhere.
Let’s make that happen, by killing off Tomcat server #4 and seeing what happens to the sessions:
As you can see, Tomcat server #4 is now returning an error when we try to access it. But more importantly, we still have our session on Tomcat server #3 – ready for the user’s next page request. We can also notice that the session replication has kicked in again and our session is now also on Tomcat server #2.
More generally, we still have all the sessions we had before plus their replicated copies – it’s just that now they are spread across three servers rather than four.
So now let’s see what happens if the user tries to continue using their session:
At this point, because HA Proxy has detected that Tomcat server #4 is no longer available, it stops applying stickiness and round robin kicks in again. In this case, it ends up directing the user to Tomcat server #1. This is an interesting situation, since the session was not stored on Tomcat server #1 (see previous screen grab). However, as stated at the start, BackupManager not only replicates the session to one other server but also lets all servers know where that session is available. Tomcat server #1 has therefore retrieved the session from either Tomcat server #2 or Tomcat server #3. As per the rules of BackupManager, it can only exist on one other server. It is therefore still on Tomcat server #2 but has been removed from Tomcat server #3.
Comments about the Load Balancer
It’s probably worth noting a couple of points around the interaction of the load balancer and the Tomcat servers. Most importantly, the load balancer does not have any visibility of the session replication between the Tomcat servers. As far as the load balancer is aware, a Tomcat server has disappeared from the cluster – but it has no visibility of where the sessions from that Tomcat server are replicated. That’s why it kicks in with round robin again in the above example.
The second point is around how the load balancer detects whether servers are available. For example, the version of HA Proxy I am using only allows a simple heartbeat – this means that there is a potential window where the load-balancer will think the server is still available even after it has crashed (and will still try to route requests to it).
Although the session does not get lost, the user will have a poor experience whilst accessing the site during that window, since they will get an error screen instead of the expected page. This will resolve itself once the load balancer has detected the server has crashed and then switches to another server where the session replication kicks in again.
To illustrate this, I will run the same set of tests as before, but this time I will fire the next request in immediately (so the load balancer does not have time to detect a Tomcat server has been stopped).
On this run, we have ended up with Tomcat server #3 as the one we used last, as can be seen from the comments in the output and the highlighted session ID in the sessions table:
So, now we stop Tomcat server #3:
As before, the sessions which were on Tomcat server #3 are replicated onto other extant servers.
For this example, we are immediately firing our next request, so the load balancer has not had time to detect that Tomcat server #3 has stopped, and still routes requests to it. This results in a failure in our tests when trying to retrieve the page (a real user would get an error page displayed in their browser):
If we try again, the load-balancer has now had time to check the heartbeat and detect that Tomcat server #3 has stopped and it reverts to round robin, directing the request to Tomcat server #1. That Tomcat server then retrieves the session and we can carry on as before:
So, it’s a robust solution in the sense that the session is not lost, but from a business perspective there is a small window in which an error page could be displayed and a sale lost – although most people would probably just refresh a couple of times and then get their session back as in the above example.
I hope this has provided a useful illustration of using BackupManager for session replication in Tomcat. Although it is a small example with a “toy” application, I think it illustrates the principles well.