Why Replicated Filesystems Are Hard

What is a replicated filesystem?

A replicated filesystem is one that stores data across multiple storage devices. Some replicated filesystem designs work at the block level, some at the file level, and some use a hybrid of both concepts.

DRBD is an example of a replica system that works at the block level. One of the replicas is the active one and the client uses this block based system, builds a filesystem on it, and (in theory) has access to the same data if the master host dies by failing over to the slave host and converting that to a master. This allows one client at any one time. For multiple clients this is usually re-shared using NFS.

GlusterFS manages it’s replication at the file level. The client’s posix calls are replicated to multiple traditional filesystems (xfs, ext3, etc) transparently. This natively allows multiple clients, provided they use locking correctly.

Ceph and MooseFS, use what I’m referring to as a hybrid. They have chunks/blocks that they distribute across multiple traditional filesystems. Typically the client connects to one server at a time and replication is handled by that server.

Even running rsync on a cron job to keep two sites synchronized is a form of replication.

How does replication work?

Simple Rsync

Let’s look at the last example above as the simplest method. Two servers, one in Melbourne, Australia the other in London, England. You have a simple text file that you use to track which micro-brews you’ve tried. While in London, you add:

The Kernel, Pale Ale

At the top of the hour, your cron job rsync’s that to your Australia office. All is good. You hop on a plane and head down under. Eventually you arrive and at the end of a very long day you and some friends go out for a pint. You open your text file and add to it:

The Kernel, Pale Ale  
Mountain Goat, Hightail Ale

and that rsync’s back. So far so good. This is a very simple form of replication.

Collaborative Tools

Git, is an example of a collaborative tool. Using git, you’ve set up a repository on your London computer.

When in Melbourne you clone the repo, make your edits, and push them back up.

Back in London, you do the same thing. Your repo manages the consistency and you have replicas in the form of working trees at both locations.

Block Based Tools

A block based tool, like DRBD, works by duplicating the blocks underlying the filesystem. They often have a journal that caches the writes and feeds them to the stand-by store to mitigate data bursts. Since they’re master-slave you can work on the local master when you’re in London. When you reach Melbourne, you switch the master there and edit your file. The blocks are replicated during each write from either side.

Clustered Filesystems

Unlike the other tools, clustered filesystems are designed for not just one person switching between locations, but thousands of clients all operating on the same filesystem simultaneously. If that doesn’t sound more complicated, then I don’t know what does.

In London you write to your file. Your text editing application locks the file. That lock is announced to the cluster servers. If nobody else has a lock, the lock is granted (yes, more granular locking is also possible, but we’re going with the more simple while file lock for the moment). The write is sent. The intention to write a block is flagged in metadata. The block is sent to the servers. The block is written. The metadata flag is released. If, at any time during that write process a server goes down, those flags will keep track of that file having failed a pending write and the file will be repaired. Once that write is handed off to the local filesystem, the client is informed and the write operation is concluded (unless you’ve set fsync in which case one or two clustered filesystems actually honor it and don’t return until the write is on disk).

So what makes replication hard?

Rsync

Now your friends all think that this sounds like fun and they want you all to share a single list to see if, as a group, you can try every micro-brew on both continents.

At 11:30am, Curtis goes to Herne Hill for an early lunch at The Florence. He adds it to the list:

The Kernel, Pale Ale  
Mountain Goat, Hightail Ale  
The Florence, Beaver

At 10:40pm, Saacha is enjoying the release party her company is throwing at 2 Brothers and adds to the list as well. Since the cron job hasn’t run yet, her edit looks like:

The Kernel, Pale Ale  
Mountain Goat, Hightail Ale  
2 Brothers, TAXI!

Now for the fun part. The cron job runs. Saacha’s edit is more recent so Curtis’ edit is lost. This happens because each replica is not aware of the other. There’s no coordination of locks or edits. During the interim between edits and synchronization, the files were in a state known as split-brain. This was easily resolved because our process was stupid enough to resolve it at the cost of data loss.

We don’t normally want data loss, so we resort to other tools.

Collaborative Tools

For the problem above, one solution is revision control system. Each party checks out a copy of the file, makes edits, then checks those edits in. As long as those edits are in different parts of the text file, the RCS handles the merging of changes just fine. In the event of contention, the user is responsible for resolving the conflict and checking their change back in. This does prevent data loss, and works fine for text files, but what about random changes in a binary file?

Take the old CISAM files for instance. Fixed length records could be edited anywhere in the file. Unless the RCS was specifically written to understand your record structure and is set up with rules to merge them, it’s not going to work. As far as most RCS are concerned, there’s (effectively) only one line being edited, and it’s always going to be in contention.

Block Based Tools

Since our DRBD device is master-slave, our buzzed friends above will have to choose which machine to edit their list on. One of them is going to have some pretty high latency. The storage will be quick, but the shell they’re using will suck. When a net-split happens, some process is going to trigger a fail- over.

London loses ping to Melbourne. London was the master. It’s happy staying the master and allows Curtis to update the file.

Melbourne loses ping to London. Melbourne has two choices. Stay a slave and hope London didn’t just sink into the Atlantic, or it might assume that London’s server has crashed and in order for the uptime guarantees to be met, it needs to become the master. It makes the choice to provide availability (based on configuration choices we made as administrators) and fails over. Saacha now adds her next entry.

The netsplit is resolved and both servers can now see each other again. They’re both masters. They both have dirty data blocks.

This device is now split-brained and both ends will go into a read-only mode to allow you, the administrator, to figure out which one is sane and lose the data from the other one.

This becomes very fun when those files were CISAM files and you need to figure out which blocks had data you needed to be on the other one and are trying to pick records out and manually merge them back in so you don’t lose thousands of financial transactions.

Clustered Filesystems

One nice thing about clustered filesystems is they’re much more resilient. Because they’re file based, you don’t lose the entire filesystem if a split- brian occurs. There are varying emphasis on Consistency, Availability, and Partition Tolerance (see CAP theorem) between the different systems. None provide all three and a pony, but they do amazingly well. GlusterFS, the one I’m the most familiar with, emphasizes consistency and availability.

Our beer list is now on a GlusterFS volume with bricks (storage nodes) in London and Melbourne. Because GlusterFS emphasizes consistency most of all, this is going to make for some pretty high latency writes. The partition tolerance efforts are going to hurt too. The minimum latency you can get between London and Melbourne (if you had and end-to-end fiber connection with no data loss running in as straight of a line as you could possibly get) would be 113ms RTT. Your standard lookup(), open(), flock(), write(), close() is looking at the very minimum of 565ms (most apps throw a stat() check in there too to make sure the file is a file and not a directory or something).

We don’t care about latency for this project. At this point we’re having enough beers that the wait is actually rather pleasant. Suddenly, netsplit.

We have two options.

If we did not configure quorum. Each client will lose connection to the remote server. Each user will be able to write to our beer list. The file is flagged with pending writes for the missing server (the far server from the perspective of either end). When the netsplit resolves, the file is attempted to be healed. The filesystem recognizes pending writes from both clients and marks the file split-brain. Manual repair is necessary.

If you configured quorum, neither client would be able to write to the file as the minimum quorum could not be met. Partition tolerance is maintained at the cost of availability. This may be less than desirable when you have two drunks banging on the keyboard wondering why they can’t save their edits. If you had a 3 replica volume, adding one replica in Texas for instance, and enforced quorum, one of the two ends would likely still be able to write their files as long as the connection loss didn’t affect all three replica.

What can I do as a sysadmin?

Know what your needs are.

More often than not I see would-be clustered filesystem users installing them all and running dd, iotop, bonnie++ to see which one’s the fastest. None of these tests take into account the real-world problems of system design.

What good would the fastest be if you lost all your data when your carrier drops? Or if it was really fast for one client, but not for hundreds?

Design for your requirements. Look at where you want your network to be able to be in 5 years. How many simultaneous users? How much data? Will your 1000000 iops all be reading the same 10 files? Writing 100000? Design for it. When you know what your needs truly are, then find the tools that best provide for that need.

Don’t think linearly, think multithreaded and to scale.

*For more ways of creating split-brain files, see What is split brain in GlusterFS and how can I cause it?