GlusterFS replication do's and don'ts

GlusterFS spreads load using a distribute hash translation (DHT) of filenames to it’s subvolumes. Those subvolumes are usually replicated to provide fault tolerance as well as some load handling. The advanced file replication translator (AFR) departs from the traditional understanding of RAID and often causes confusion (especially when marketing people try to call it RAID to make it easier to sell). Hopefully, this should help clear that up.

When should I use replication?

Fault Tolerance

The traditional filesystem handled fault tolerance with RAID. Often, that storage was shared between servers using NFS to allow multiple hosts to access the same files. This would leave the design with several single points of failure, the server cpu, power supply, raid controller, NIC, motherboard, memory, software, and the network cable, and switch/router.

Traditionally many web host implementations overcame this limitation by using some eventually consistent method (ie. rsync on a cron job) to keep a copy of the entire file tree on the local storage for every server. Though this is a valid option, it comes with some limitations. Wasted disk space is becoming an increasing problem as more and more data is being collected and analyzed. Synchronization failures sometimes go unchecked, any user-provided content was not immediately available, etc.

By replacing the fault tolerance of RAID with replication allows you to spread that vulnerability between 2 or more complete servers. With multipath routing using two or more network cards, you can eliminate most single points of failure between your client and your data.

This works because the client connects directly to every server in the volume1. If a server or network connection goes down, the client will continue to operate with the remaining server or servers (depending on the level of replication). When the missing server returns, the self-heal daemon or, if you access a stale file, your own client will update the stale server with the current data.

The down side to this method is that in order to assure consistency and that the client is not getting stale data, is needs to request metadata from each replica. This is done during the lookup() portion of establishing a file descriptor (FD). This can be a lot of overhead if you’re opening thousands of small files, or even if you’re trying to open thousands of files that don’t exist. This is what makes most php applications slow on GlusterFS.

Load Balancing

By having your files replicated between multiple servers, in a large-file read-heavy environment, such as a streaming music or video provider, you have the ability to spread those large reads among multiple replicas. The replica translator works on a first-to-respond basis so if requesting a specific file becomes popular enough that it starts to cause load on a server, the less loaded server will respond first ensuring the optimal performance2 to the end- user.

What’s a poor way to use replication?

A copy on every server

Some admins, stuck in the thought process that if it’s not on the server it won’t be available if a server goes down, increase the number of replicas with each new server. This increases the number of queries and responses necessary to complete the lookup and open a FD which makes it take even longer. Additionally, writes are written to every replica simultaneously so your bandwidth for writes is divided by the number of replicas.

You probably don’t really want that behavior. What you want is to have the file available to every server despite hardware failure. This does require some prediction. Decide what your likelihood of simultaneous failure is between your replicas. If you think that out of N machines, you’re only likely to have 1 machine down at any one time, then your replica count should be 2. If you consider 2 simultaneous failures to be likely you only need to have replica 3.

When determining failure probability, look at your system as a whole. If you have 100 servers and you predict that you might have as many as 3 failures at one time, what’s the likelihood that all three servers will be part of the same replica set?

Most admins that I’ve spoken with use a simple replica count of 2. Ed Wyse & Co, Inc. has fewer servers so a replica count of 3 is warranted as the likelihood of simultaneous failures within the replica set is higher.

Across high-latency connections

GlusterFS is latency dependent. Since self-heal checks are done when establishing the FD and the client connects to all the servers in the volume simultaneously, high latency (mult-zone) replication is not normally advisable. Each lookup will query both sides of the replica. A simple directory listing of 100 files across a 200ms connection would require 400 round trips totaling 80 seconds. A single drupal page could take around 20 minutes.

To replicate read-only data3 across zones, use geo-replication.

Summary

Do

Use replica 2 or 3 for most cases
Replicate across servers to minimize SPOF

Don’t

Require that your clients are also servers (they may be, but that should be a decision that’s independently made)
Replicate to every server just to insure data availability
Replicate across zones

These are, of course, general practices. There are reasons to break these rules, but in doing so you’ll find other complications. Like every bit of advice I offer on this blog, feel free to break the rules but know why you’re breaking the rules.

1only when using the FUSE client
2there are further improvements slated to this routine to more actively spread the load
3as of 3.3 geo-replication is one-way