Keeping your VMs from going read-only when encountering a ping-timeout in GlusterFS

GlusterFS communicates over TCP. This allows for stateful handling of file descriptors and locks. If, however, a server fails completely, kernel panic, power loss, some idiot with a reset button… the client will wait for ping- timeout (42 by the defaults) seconds before abandoning that TCP connection. This is important because re-establishing FDs and locks can be a very expensive operation. As glusterbot says in #gluster:

Allowing a longer time to reestablish connections is logical, unless you have servers that frequently die.

When you’re hosting VM images on GlusterFS, that 42 seconds will cause your ext4 filesystems to error and become read-only. You have two options:

Shorten the ping-timeout
You can shorten the ping-timout by setting the volume option, network.ping- timeout
Change ext4’s error behavior
You can change ext4’s error behavior with the mount option, “errors=continue” or by changing the default in the superblock using tune2fs