mdadm with bcache and btrfs

published:

This is the setup for data storage on my server. It works well for my usecase. You should do your own research, before blindly copying this.

Overview

The setup is as follows:

I have multiple (slow but big) HDDs in a RAID6 configuration. In one server I use 4TB WD Red drives, in the other 12TB Seagate IronWolf drives. (I would go with 8TB or 10TB drives for a new server - that would keep the cost and rebuild time down compared to the 12TB variant, but at the same time provide more usable space than the 4TB drives)

There are also two 500GB WD Red SSDs, in RAID1 configuration, as a caching layer. The servers are connected via 2.5 Gbps ethernet, and when adding a lot of (multiple hundreds of GB of) small files, the disk speed of the HDDs becomes a bottleneck. The SSD cache helps to keep the speed up, and then writes the cached files to the HDDs in the background. It also caches often used files for faster access times (although the RAID6 itself almost satifies the 2.5 Gbps ethernet link, when no concurrent writes occur). Because I use 'writeback' as a cache mode, all data would be lost if the caching device dies - that's the reason there are two SSDs in RAID1 configuration: if one dies, the write to the backing device can be finished without loosing or corrupting any data.

I also use btrfs as a filesystem, because it provides snapshots, so I can have automatic hourly instant snapshots of all the files, and restore them if needed.

This is only for the data storage on my server. Using this setup as a root partition may require more work to be done.

This setup results in the following structure:

[ssd1] --\
          >--- [raid1] -- [bcache cache] ----\
[ssd2] --/                                    \
                                               \                          /-- [subvolume1]
                                                >-- [bcache] -- [btrfs] -< -- [subvolume2]
[hdd1] --\                                     /                          \       ...
[hdd2] ---\                                   /                            \- [subvolumeN]
 ...       >-- [raid6] -- [bcache backing] --/
[hddN] ---/

I use the drives directly, without partitions. The drives are the same anyway, so there will be no problem when I add more drives, and I didn't run into any problems yet for having no partitions. You may want to add partitions to the drives first, and then use /dev/sdb1 instead of /dev/sdb (and so on) when creating the RAIDs.

Creating the RAID, cache and filesystem

This is roughly how I set it up:

lsblk
# this will get us all connected drives. let's assume the following:
# /dev/sdb - /dev/sde are 4 HDDs (the minimum for RAID6)
#     --> they will become /dev/md0 for the backing device
# /dev/sdh and /dev/sdi are our SSDs
#     --> they will become /dev/md1 for the cache

# first, let's create the RAID6 device (HDDs):
sudo mdadm --create --verbose /dev/md0 --level=6 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde
# this will take some time, use
watch cat /proc/mdstat
# to see when the building is finished
# we then add the configuration to the mdadm.conf:
sudo mdadm --detail --scan /dev/md0 | sudo tee -a /etc/mdadm/mdadm.conf
# we also edit the /etc/mdadm/mdadm.conf to add 'MAILADDR mail@example.com', so mdadm can
# send us warnings about degraded devices, as long as the server can send emails

# next, let's create the RAID1 device (SSDs):
sudo mdadm --create --verbose /dev/md1 --level=1 --raid-devices=2 /dev/sdh /dev/sdi
watch cat /proc/mdstat
# when it's done:
sudo mdadm --detail --scan /dev/md1 | sudo tee -a /etc/mdadm/mdadm.conf

# now, we can set up bcache on top of mdadm; first the backing device:
sudo make-bcache -B /dev/md0
# then the caching device:
sudo make-bcache -C /dev/md1
# we then need the UUID of the cache device:
sudo bcache-super-show /dev/md1 | grep cset
# and attach it to the backing device:
sudo bash -c 'echo "[UUID of caching device]" > /sys/block/bcache0/bcache/attach'
# we then change the cache mode to 'writeback' -- attention, this may result
# in data loss or corruption! make sure you know, what the different cache modes
# do before setting it to 'writeback'!
sudo bash -c 'echo "writeback" > /sys/block/bcache0/bcache/cache_mode'
cat /sys/block/bcache0/bcache/cache_mode
# let's check the status of bcache:
cat /sys/block/bcache0/bcache/state
# should be 'clean'

# now, let's add btrfs on top of bcache (on top of mdadm):
sudo mkfs.btrfs -L data /dev/bcache0
# we then need the uuid of the btrfs filesystem:
sudo btrfs filesystem show
sudo mkdir /mnt/tmp
sudo mount -o compress=zstd -U [UUID of btrfs filesystem] /mnt/tmp/
# I mounted btrfs with compression. You could ommit '-o compress=zstd' to not use
# compression or change the compression algorithm. read the btrfs man page.

# I want to use different subvolumes for different mount points, so I create them now
# (subvol1 and subvol2 are just example names. use something like 'webserver'
# or 'data' in the real world)
sudo btrfs subvolume create /mnt/tmp/subvol1
sudo btrfs subvolume create /mnt/tmp/subvol2
# and so on ..
sudo btrfs subvolume list -p /mnt/tmp
sudo umount /mnt/tmp
sudo rmdir /mnt/tmp

# we then create the mountpoints for the subvolumes. this is just an example:
sudo mkdir /srv/subvol1
sudo mkdir /srv/subvol2
# and then mount them:
sudo mount -o compress=zstd -U [UUID of btrfs filesystem] -o subvol=subvol1 /srv/subvol1/
sudo mount -o compress=zstd -U [UUID of btrfs filesystem] -o subvol=subvol2 /srv/subvol2/
# if a subvolume should have no COW, I disable it now:
sudo chattr +C /srv/subvol2/

# I then set up snapper to automatically make hourly, daily, monthly and yearly snapshots:
sudo snapper -c subvol1 create-config /srv/subvol1
# this is repeated for every subvolume that needs automatic snapshots
# see the snapper manual for details

we also need to edit the /etc/fstab to automount the subvolumes on startup:

UUID=[UUID of btrfs filesystem]	/srv/subvol1	btrfs	subvol=subvol1,compress=zstd	0	0
UUID=[UUID of btrfs filesystem]	/srv/subvol2	btrfs	subvol=subvol2,compress=zstd	0	0

Adding more disks

If I ever run out of space, I can just add more disks to the RAID6:

cat /proc/mdstat
sudo mdadm --detail /dev/md0
# make sure that 'state' is set to 'clean'
# let's assume, the new disk is /dev/sdf:
sudo mdadm --add /dev/md0 /dev/sdf
sudo mdadm --detail /dev/md0
# 'sdf' should appear as a spare
sudo mdadm -v --grow --raid-devices=5 /dev/md0
# raid devices is the new number of disks in the raid array
# we added a 5th disk, so we set --raid-devices=5
watch cat /proc/mdstat
# this will take a long time (like in 'days') to finish
sudo mdadm --detail /dev/md0
# 'state' should be 'clean'
# we then need to expand the btrfs filesystem
# you can use any subvolume for this, it will be applied to all subvolumes at the same time:
sudo btrfs filesystem resize max /srv/subvol1
df -h
# should now show more space

The workflow for replacing a dead disk should be similiar.

----------
Have a comment? Drop me an email!