Should I use Stripe on GlusterFS?

Frequently I have new users come into #gluster with their first ever GlusterFS volume being a stripe volume. Why? Because they’re sure that’s the right way to get better performance.

That ain’t necessarily so. The stripe translator was designed to allow a file to exceed the size of a single brick. That was its designed purpose, not for parallel reads and writes.

The Expectation

In a RAID0 stack, you use striping to allow each drive to operate in parallel. This will give both read and write performance increases, especially if you tune stripe sizes to your use. Your bottleneck is still (most likely) going to be the drive speed. Every piece along the way is faster. Operations are typically coming from a small handful of applications, and seeks are likely kept to a minimum. So why would striping across multiple computers be any different?

Reality

Multiple Clients

Networked filesystems typically are associated with a myriad of clients, each with their own task. This can cause a wild array of file requests that the server is going to try to satisfy as quickly as possible requiring reading and writing data all over the disk. If you only have a few files that are typically accessed, this may not be a problem, but more generally it seems to be.

With a distributed volume, file requests will usually end up being spread evenly among your servers, allowing fewer disk seeks.

Load Balance

With a striped volume, your load distribution is going to affect several of those servers, but not necessarily equally. If your typical file falls within the stripe*bricks size, then it should be pretty equal, but since most files are going to fall outside of that, the first server in each stripe set will actually have a higher load. This is due to the fact that offset 0 of any file is always on the first subvolume.

With a distributed volume, the files are already spread among the servers causing file I/O to be spread fairly evenly among them as well, thus probably providing the benefit you might expect with stripe.

Network

Obviously, if your network speed is slower than your disk speed, it doesn’t really matter how many disks you have working on a file, you’re still stuck with your network speed.

Data Integrity

Striped files are going to be lost. When a hard drive fails, striped files are gone. The more disks you add to a stripe, the higher the likelihood of failure. If you decide that you are going to use stripe, have a backup plan.

Distribute, when used alone, is still susceptible to data loss due to disk failure, but only to the files that are actually on that disk. As files are stored whole, disaster recovery is also still possible.

Summary

Distribute

Distribute is the default volume configuration of choice. It stores whole files and distributes those files among your bricks. When using many clients that access many disparate files, this will provide the greatest load distribution. Overloaded clusters can be expanded by adding more servers. Each of the following translators will combine with Distribute when a multiple of the number of designed bricks is added (4, 6, or 2^n bricks in a stripe 2 volume, 6,9,12… bricks in a replica 3 volume, 12 bricks in a replica 2 stripe 3 volume, etc.).

Stripe

When using files that exceed the size of your bricks, or when using a small number of large files that have i/o operations done to them in random seek locations (such as huge isam files), stripe may be a good fit. Files stored on a stripe volume should be throw-away as there’s no data integrity and a single brick failure will lose all the data.

Stripe + Replicate

New with 3.3, stripe + replicate will offer improved read performance over stripe alone, as well as system redundancy and data integrity security. This should still only be used with over-brick-sized files, or large files with random i/o.

Replicate ( + Distribute )

The most common, replicate offers read load sharing, data integrity, and redundancy. Replicate + Distribute is most likely the volume configuration you should actually be using.

If you have a configuration where a striped volume actually test performs better for your actual use case, please write a whitepaper about it and let me know. I’d be happy to reference it.