My life with a geo-replicated GlusterFS setup

Recently, I had to build a geo-replicated GlusterFS setup and encountered a bunch of issues due to the not-so-up-to-date documentation and the unlucky thing. So much time spent reviewing things and discussing with the dev team over IRC. Here’s my (recent) life with a geo-replicated GlusterFS setup.

My life with a geo-replicated GlusterFS setup
My life with a geo-replicated GlusterFS setup

But why ?

Before we get started, it would be good if I gave you a little context. So, this is an assignment for a software company.

They provide their customers with large amounts of data (about 11 TB). The publisher being French, and for reasons of ease and sysadmin skills - at the time of implementation - has created all its infrastructures in France.

Due to the fact that they have little control over their network (provided by another service provider from which they cannot leave for the moment) and a worldwide audience (with a strong increase on the North American continent), they were forced to deploy edge servers (HTTP services allowing the local distribution of its famous 11 TB).

This then irremediably leads to high latencies and low throughput for these North American users. Not a very pleasant user experience, you’ll tell me about it. If we talk about numbers:

  • Latency: 2 seconds compared to 0.3 seconds from France.
  • Throughput: 500 kB/s compared to 15 MB/s from France (and even 150 MB/s locally at storage)

So they decided to call me to see how we could improve things with the following constraints:

  • rely on the existing provider (which has infrastructure in Europe and North America)
  • with the smoothest possible future migration to make it a quick-win

After an audit, we come to a decision for the following setup:

  • local cluster in France, based on a distributed GlusterFS setup
  • Local cluster in North America, also based on a distributed GlusterFS setup, geo-replicated on the first one in real time.

The interesting thing here will also be to recycle technological bricks that the editor’s team already knows about, and to stay on something quite high level: GlusterFS is an overlay to the underlying filesystem (XFS was our choice here). The GlusterFS documentation (since version 3.5) also promises the correct operation of geo-replication over a distributed setup.

Please, note also that I’ll resume the operations here, not describing the minutes of each action.

The setup

I enjoy working with Ansible to deploy my small clusters, like many DevOps. So I’m starting on it, writing my own playbooks. Be careful, the project presented on the previous link is the result of the work described here, after a little obvious cleaning, and potentially before a sequel …

The idea of this setup is to have the following elements:

  • Each country has its own cluster GlusterFS (the master being in France)
  • Each cluster has two servers in distributed mode
  • Each server has 10 disks of 4 TB available for this storage, built in software Raid type 10, formatted in XFS
  • Each node mounts the cluster locally with these parameters: 127.0.0.1:/storage-fr /opt/self glusterfs defaults,_netdev,noatime,log-level=ERROR,log-file=/var/log/gluster.log,direct-io-mode=no,fopen-keep-cache,negative-timeout=10,attribute-timeout=30,fetch-attempts=5,use-readdirp=no,backupvolfile-server=fs-fr-02 0 0

GlusterFS Distributed Volume
GlusterFS Distributed Volume

That’s the setup at the very beginning. Now let’s live it.

And then there’s the drama

Thanks to the Ansible playbook, the entire deployment goes smoothly and without errors. So I import my 11 TB and then I admit… but not for long. It’s perfect, but a SysOps rule scares me:

If everything goes well on the first try, you’ve forgotten something.

Fortunately for me, not everything is so beautiful: geo-replication doesn’t start. A glance at the geo-replication status does indeed bring me back to reality:

gluster volume geo-replication storage-fr fs-ca-01::storage-ca status

MASTER NODE      MASTER VOL    MASTER BRICK        SLAVE USER    SLAVE                       SLAVE NODE       STATUS    CRAWL STATUS                   LAST_SYNCED
------------------------------------------------------------------------------------------------------------------------------------------------------------------
fs-fr-01         storage-fr    /opt/brick           root         fs-ca-01::storage-ca        N/A              Faulty    N/A                            N/A
fs-fr-02         storage-fr    /opt/brick           root         fs-ca-02::storage-ca        N/A              Faulty    N/A                            N/A

In fact, the startup passes the replication to Active and then immediately to Faulty. For a successful setup, I would come back. So obviously, we always start by consulting the logs first:

less `gluster volume geo-replication storage-fr fs-ca-01::storage-ca config log-file`

And here we see a beautiful log entry, extremely explicit:

E [syncdutils(worker
/gfs1-data/brick):338:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py",
line 322, in main
    func(args)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
line 82, in subcmd_worker
    local.service_loop(remote)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
line 1277, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 599, in crawlwrap
    self.crawl()
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1555, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1455, in changelogs_batch_process
    self.process(batch)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1290, in process
    self.process_change(change, done, retry)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1229, in process_change
    st = lstat(go[0])
  File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
564, in lstat
    return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE, EBUSY])
  File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
546, in errno_wrap
    return call(*arg)
OSError: [Errno 22] Invalid argument:
'.gfid/1ab24e67-1234-abcd-5f6g-1ab24e67'

First though is about a specific issue at some files. Checking the file matching the gfid, it occurs to seem valid. As an attempt to save the day, I get rid of it and restart the geo-replication. With no success at it fails the same way. Discussing with the Gluster dev team over IRC, they wonder if the amount of initial data might not be the issue. Fair enough, I get read of the data 1 TB per 1 TB, testing the geo-replication at each stage. Still no luck. As I end with a totally wiped folder, with no more data, I wondered if the changelog did not get corrupted or filled up during those test. I decide to wipe the complete cluster and attempt once again the deployment with Ansible, but without importing any data. To wipe the setup on the servers, I used the following commands:

sudo ansible fs-fr-01 -m shell -a "gluster volume geo-replication storage-fr fs-ca-01::storage-ca stop force"
sudo ansible fs-fr-01 -m shell -a "gluster volume geo-replication storage-fr fs-ca-01::storage-ca delete reset-sync-time"
sudo ansible fs_fr -m shell -a "echo y | gluster volume stop storage-fr"
sudo ansible fs_ca -m shell -a "echo y | gluster volume stop storage-ca"
sudo ansible fs -m shell -a "dpkg -l | grep gluster | awk '{ print \\$2 }' | xargs sudo apt -y remove"
sudo ansible fs -m shell -a "service glusterd stop"
sudo ansible fs -m shell -a "umount /opt/self"
sudo ansible fs -m shell -a "setfattr -x trusted.glusterfs.volume-id /opt/brick"
sudo ansible fs -m shell -a "setfattr -x trusted.gfid /opt/brick"
sudo ansible fs -m shell -a "rm -fR /opt/brick/.glusterfs"
sudo ansible fs -m shell -a "find /usr -name '*gluster*' -exec rm -fR {} \\;"
sudo ansible fs -m shell -a "find /var -name '*gluster*' -exec rm -fR {} \\;"
sudo ansible fs -m shell -a "find /etc -name '*gluster*' -exec rm -fR {} \\;"

And as you might have expected: it failed over once again!

E [syncdutils(worker
/gfs1-data/brick):338:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py",
line 322, in main
    func(args)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
line 82, in subcmd_worker
    local.service_loop(remote)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
line 1277, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 599, in crawlwrap
    self.crawl()
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1555, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1455, in changelogs_batch_process
    self.process(batch)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1290, in process
    self.process_change(change, done, retry)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1229, in process_change
    st = lstat(go[0])
  File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
564, in lstat
    return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE, EBUSY])
  File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
546, in errno_wrap
    return call(*arg)
OSError: [Errno 22] Invalid argument:
'.gfid/00000000-0000-0000-0000-000000000001'

This time, the gfid matches the Gluster root: /opt/self. How may it fails over the empty root? I review the complete setup with the dev guys and we could not find any issues with it: it correspond to the simpliest version of the documentation, without any potential permission issue as I built the setup as a root to begin with. One of them proposed to use a python script that can do it all “by magic”. The tool installation and usage is pretty straight forward … but yet: no luck.

After several days of digging, building and wiping things, using different hardware and VMs, I asked a simple question: is the distribued setup meant to be the source of the geo-replication? Colleagial positive answer as it’s available since version 3.5, as we are at version 7.3 and none of them touch this part of the code, it should work. Right. Should is the key to it all. None of them tested this setup since the branch 4.x of GlusterFS. Or fair enough, I can build a new setup quite quickly: let’s move to a replicated setup as a source to the geo-replication… And voilà! Now it works perfectly fine. There was some regression at least since the branch 5.x and no once noticed.

Not the most perfect solution but I can still achieve the same results my customer expect. Let’s move to a new setup:

  • Each country has its own cluster GlusterFS (the master being in France)
  • Each cluster has two servers in replicated mode
  • Each server has 10 disks of 4 TB available for this storage, built in software Raid type 0, formatted in XFS
  • Each node mounts the cluster locally with these parameters: `127.0.0.1:/storage-fr /opt/self glusterfs defaults,_netdev,noatime,log-level=ERROR,log-file=/var/log/gluster.log,direct-io-mode=no,fopen-keep-cache,negative-timeout=10,attribute-timeout=30,fetch-attempts=5,use-readdirp=no,backupvolfile-server=fs-fr-02 0 0

GlusterFS Replicated Volume
GlusterFS Replicated Volume

Ok we have a working setup, we can import the 11 TB. It will take quite some time. I run a simple rsync command in a screen without supervising it, and I’ll repeat the command over and over until the migration to be as up-to-date as possible prior to the M-Day:

rsync -PWavu /mnt/old/ /mnt/new/ --delete-during --ignore-errors

So far so good, the M-Day is here and we decide to proceed with the migration. We update the mount points and benefit from the same options as previously. The customer is happy as the promised performance are reach in term of latency and throughput. Meaning, I’m happy and relieved.

Did you think it was over?

A couple hours later, his biggest customer in the US reach him out as they can’t access any files. Checking the logs, I can see a bunch of HTTP codes 200 and 304, so I wonder if the issue is a pebcak or an intermediate cache issue somewhere else: I request a couple of problematic URL to test.

Checking the provided URL on both FR and CA platforms, I succeed to reproduce the issue. Checking first the edge servers, they show that the files do not exists. As the platform data was up-to-date, it is pretty unclear. Comparing the same folder on both platforms, the CA instance appear … empty. But df reports the same volume usage. I decide to just add a simple -a param to my ls … and there they all are: the files exist, with the proper size, in the proper folder but instead of player.json (for instance), it’s named .player.json.AbvgGY. Weird ? Not so much as this is the format of the temp files from an rsync. But why do GlusterFS do not fix them on the CA platform as the data and naming are valid on the FR side ? Simply because it relies on a simple checksum of the content, ignoring the name of the files (it expects to track the rename() operation on the fly).

What's your name? Fuck you, that's my name.
What’s your name? Fuck you, that’s my name.

At this stage, my guess is that something happened during the data importation leading to the wrong file being copied. Checking each cluster logs, status and details (including the heal info), it appears that a split-brain occured on the FR platform. Pretty logicial with only two nodes in it. Digging a bit more on the timestamps, logs, states, … At the end, a pretty race condition of multiple things:

  • a pebcak (from me): I ran the rsync command with no specific param, leading to the creation of temp files ; I should have added --inplace argument to ensure using the proper name at once
  • network issues at the provider stage: kinda expected but it occured multiple times, and part during the rename() operations
  • GlusterFS georep being too simple: why the heck don’t they include the name in the validation of the content?

So, as long as it’s simple to heal the FR cluster (both split-brain and invalid data on one node), fixing the names on the CA cluster is not obvious. There is no GlusterFS mechanism to do so, and I can’t afford waiting for a week for the data to sync again if I decide to wipe the data. From here, I come with a new idea that is just an abuse of what GlusterFS is and how to fix a split-brain situation.

I know that I have no more customer connections to the CA cluster as I moved them all to the FR cluster while solving the issue. To summarize the idea:

  • GlusterFS considers that writing to a local brick (outside of GlusterFS) is an issue as it will lead to a split-brain
  • A split-brain can be resolve by reading locally the file via GlusterFS (mountpoint over 127.0.0.1) if there is no concurrent access to it
  • a rename() operation does not change the gfid, it just update the underlying link GlusterFS use
  • We know the data is valid, just the name is an issue

Thanks to this, I write a quick script and run it over my folders:

#!/bin/bash

find /opt/brick/ -name ".*" -type f | while read f; do
        rename -f -d -v 's/\\.([^.]*)\\.([^.]*)\\..*/$1.$2/' $f
        brick=`dirname $f`
        self=`echo $brick | sed 's/brick/self/g'`
        filename=`basename $f`
        echo $self | xargs ls -la &> /dev/null
        dest=`dirname $brick`
        sudo rsync -a --inplace  $brick root@fs-ca-02:$dest --delete-after > /dev/null
done

exit 0

Not the proper thing but efficient at least. It rename the file and ensure to solve the split-brain, before removing any invalid remaining dot file on the other node of the cluster. Neat.

One more thing…

We should be fine. “Should”. But we are not. New issue occurs on the edge servers as a full heal is running on the FR platform. Some of them expose corrupted files that are being healed, then cache those invalid files. As they are ok on both nodes of the cluster, the issue is clearly on the edge servers. To solve it, we just need to umount/mount the Gluster endpoint. Quick, efficient. We should re-do it once the full heal is achieved.

Besides all those issues with GlusterFS 7.3, I used to run a bunch of setup based on it … until the branch 4.x included. It used to be a bit more stable/reliable. I love the solution as it’s just a new layer, allowing you to build lots of things under it or around it. When it works, it just works. But when you have issues, the logs are not your best friend. I thing this is the main issue for a larger adoption of GlusterFS at this stage: the log entries and the documentation.

Do you use GlusterFS to build your reliable storage or geo-replication services ? Do you struggle with it ? Feel free to ping me to discuss more about it.

Vus : 1357
Publié par Francois Aichelbaum : 171