What is this new .glusterfs directory in 3.3?

Posted by Joe Julian 4 years, 9 months ago (comments)

Version 3.3 introduced a new structure to the bricks, the .glusterfs directory. So what is it?

The GFID

As you're probably aware, GlusterFS stores metadata info in extended attributes. One of these bits of metadata is the "trusted.gfid". This is, for all intents and purposed the inode number. A uuid that's unique to each file across the entire cluster. This worked pretty well for 3.1 and 3.2, but there were always a few weaknesses with regard to AFR (automatic file replication).

The GFID is used to build the structure of the .glusterfs directory. Each file is hardlinked to a path that takes the first two digits and makes a directory, then the next two digits makes the next one, and finally the complete uuid.

For instance

# getfattr -m . -d -e hex /data/glusterfs/d_home/stat.c
getfattr: Removing leading '/' from absolute path names
# file: data/glusterfs/d_home/stat.c
trusted.afr.home-client-10=0x000000000000000000000000
trusted.afr.home-client-11=0x000000000000000000000000
trusted.afr.home-client-9=0x000000000000000000000000
trusted.gfid=0xc62757554baf4a33bc7690c56dac23e0

makes a hardlink to

/data/glusterfs/d_home/.glusterfs/c6/27/c6275755-4baf-4a33-bc76-90c56dac23e0

So what?

Several ways in which the prior method was deficient was for deletes, renames, and hardlinks. If the connection was lost to a replica and a file was renamed, how would we know that it wasn't just deleted (or vice versa)?This caused issues where duplicate files were created causing confusion.

Now if a file is deleted, so is its .glusterfs file. The self-heal daemon can walk the tree of the reconnected server and see a file that doesn't exist on the good server. Since the gfid file is also gone, it's deleted. If the gfid file of the missing file does still exist, it's been renamed. The filename can be deleted from the stale server, but not the gfid file. Once the self-heal daemon walks to the new filename, that filename is then hardlinked with the data that's still on the server. This also reduces the need for data transfer to heal a renamed file.

If a file was hardlinked, You were generally screwed. Eventually a disconnect would happen. A file would get stale. When the self-heal happened, the client had no way of knowing that there was another file with the same gfid, so it would create one. Lots of unnecessary file duplication was created. With the gfid files, each filename is hardlinked to the same gfid file so there's no waste.

NFSv4

Coming soon to GlusterFS is NFSv4 support in which you can have anonymous file descriptors. gfid files allow that to happen by creating the gfid file without creating an entry in the directory tree.

What's that mean to me as an admin?

As an admin that means that you now have to manage gfid files as well as tree files with regard to self-heal and split-brain (see the article on healing split-brain). To do that I thought it might be useful to know how it's layed out.

To begin with, the root directory of each brick has the gfid of 00000000-0000-0000-0000-000000000001. This puts it in .glusterfs/00/00. It's gfid file is a symlink that points to "../../..". If it's not, you'll get self-heal failures healing "/". Still not sure I how got them, but after creating a multiple split-brain scenerio with my replica 3 servers, some of the root gfid files were directories instead of symlinks (bug #859581).

Directories each create symlinks that point to the gfid of themselves within the gfid of their parent. So my home directory:

# getfattr -m . -d -e hex /data/glusterfs/d_home/jjulian
getfattr: Removing leading '/' from absolute path names
# file:data/glusterfs/d_home/jjulian
security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000
trusted.afr.home-client-10=0x000000000000000000000000
trusted.afr.home-client-11=0x000000000000000000000000
trusted.afr.home-client-9=0x000000000000000000000000
trusted.gfid=0xa0d421e0c3f249d4b2ee64e101c233af
trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff

Creates a symlink like

/data/glusterfs/d_home/.glusterfs/a0/d4/a0d421e0-c3f2-49d4-b2ee-64e101c233af -> ../../00/00/00000000-0000-0000-0000-000000000001/jjulian

The next directory down would point to ../../a0/d4/a0d421e0-c3f2-49d4-b2ee-64e101c233af/${self}, etc.

Symlinks retain their same symlink but with the gfid name:

# ls -l /data/glusterfs/b_home/jjulian/.fedora-upload-ca.cert
lrwxrwxrwx 2 root root 22 Sep 21 09:42 /data/glusterfs/b_home/jjulian/.fedora-upload-ca.cert -> .fedora-server-ca.cert
# getfattr -h -n trusted.gfid -e hex /data/glusterfs/b_home/jjulian/.fedora-upload-ca.cert
getfattr: Removing leading '/' from absolute path names
# file: data/glusterfs/b_home/jjulian/.fedora-upload-ca.cert
trusted.gfid=0x4bfc7da690004fe4b54eb0399984b712
# ls -l /var/spool/glusterfs/b_home/.glusterfs/4b/fc/4bfc7da6-9000-4fe4-b54e-b0399984b712
lrwxrwxrwx 2 root root 22 Sep 21 09:42 /data/glusterfs/b_home/.glusterfs/4b/fc/4bfc7da6-9000-4fe4-b54e-b0399984b712 -> .fedora-server-ca.cert

If you delete a file from a brick without deleting it's gfid hardlink, the filename will be restored as part of the self-heal process and that filename will be linked back with it's gfid file. If that gfid file is broken, the filename file will be as well.