DHT misses are expensive

Posted by Joe Julian 4 years, 7 months ago (comments)

In a distributed hash table lookup, like that used by GlusterFS, misses are expensive. Let's look at how it works and why misses are "bad".

How DHT works

When you open() a file, the distribute translator is giving one piece of information to find your file, the filename. To determine where that file is, the translator runs the filename through a hashing algorithm in order to turn that filename into a number.

gf_dm_hash.py

#!/bin/env python
import ctypes
import sys 

glusterfs = ctypes.cdll.LoadLibrary("libglusterfs.so.0")

def gf_dm_hashfn(filename):
    return ctypes.c_uint32(glusterfs.gf_dm_hashfn(
        filename,
        len(filename)))

if __name__ == "__main__":
    print hex(gf_dm_hashfn(sys.argv[1]).value)

You can then calculate the hash for a filename:

# python gf_dm_hash.py camelot.blend
0x99d1b6fL

From this the distribute translator looks to see if it has the mappings for that directory cached. If it doesn't, it queries all the distribute subvolumes for the dht mappings for that directory. Those mappings are stored in extended attributes and look like:

# getfattr -n trusted.glusterfs.dht -e hex */models/silly_places
# file: a/models/silly_places
trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff

# file: b/models/silly_places
trusted.glusterfs.dht=0x0000000100000000000000003ffffffe

# file: c/models/silly_places
trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd

# file: d/models/silly_places
trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc

The trusted.glusterfs.dht value ends in two uint32 values. These are the start and end values for the dht has that belongs in that directory. In this example, 0x00000000 <= 0x099d1b6f <= 0x3ffffffe so the file belongs on brick b.

Now the lookup is sent to brick b. If the file is there, great. That was pretty quick and efficient.

If the file's not there, hopefully there's a file there with the same filename, zero bytes, mode 1000 with the extended attribute "trusted.glusterfs.dht.linkto". This is what we call the sticky-pointer, or more correctly the dht link pointer. This tells the distribute translator "yes, the file should be here based on it's hash but it's actually at...". This happens, for instance, when a file is renamed. Rather than use a bunch of network resources moving the file, a pointer is created where we expect the new filename to hash out to that points to where the file actually is. Two network calls, no big deal.

But what if the file doesn't exist?

If, however, the file doesn't exist there at all, the client calls dht_lookup_everywhere. As you might suspect from the name, this sends a lookup to each distribute subvolume. In my little 4x3 volume, that means 4 lookups out of distribute, and 3 lookups each out of replicate for a total of 12 lookups. Now these are done essentially in parallel (the serial network connection prevents true parallel) but that's still a lot of overhead.

So What?

If your application looks for files that don't exist frequently, this adds a lot of wasted lookups as the client queries every distrubte subvolume every time the file doesn't exist. If this is, for instance, your average php app, there's commonly a long include path that gets searched for each of 1000 includes. It's not uncommon for 30000 non-existent files to be referenced for a single page load.

What can be done about it?

The gluster developers are working on mitigating that. Jeff Darcy created a sample python plugin translator that caches entries that just don't exist and saves all those lookups by just replying that the file wasn't there a second ago so it's still not there.

  • Optimize your include path.
  • If you have control of the code you're running on a gluster volume, make your includes use absolute paths.
  • Perhaps make a table of files that do exist and search that instead of your entire cluster.
  • Put caches as close to the user as possible in hopes that they never get as far as the filesystem.
  • Use geo-replicate to have consistent code between servers on a local filesystem and only use a writable gluster volume for content.
  • Finally, you can set lookup-unhashed off. This will cause distribute to assume that if the file or linkfile cannot be found using the hash, the file doesn't exist.