Lol How Do I RAID

Version 1.4.1-alpha.5
2024, 2025 eomanis
PGP signature

How to create an encrypted software RAID 6 on Linux

For cold storage (lots of data that, once written, does not change much, but may be read now and then)

A guide that gets shit done, along with some dos and don'ts with whys and whyn'ts

Scope

This guide is about setting up a single bigass stripe-aligned XFS file system on top of a single dm-crypt volume on top of a single RAID

Nothing more, nothing less; a resilient low-complexity solution that is known to work reliably

If you do not want encryption you may leave out the respective steps, i.e. you create the file system directly on the RAID

You may also substitute XFS with a different file system type if you are so inclined

Covered in particular

Topics not covered

Notable required knowledge

The disks

Here are our 6 20TB disks that will be assembled to a RAID 6, on top of which we'll put a dm-crypt volume, on top of which we'll create an XFS file system that is aligned to the RAID geometry for optimal performance:

[root@the-server ~]# ls -l /dev/disk/by-id | grep ata-TOSHIBA_

lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXUXXXX -> ../../sde
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXOXXXX -> ../../sdb
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXFXXXX -> ../../sdf
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXAXXXX -> ../../sdc
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXIXXXX -> ../../sda
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXX6XXXX -> ../../sdd

RAID creation

A RAID 6 uses 2 disks for parity/recovery information, and since we have 6 disks that'll get us the effective storage space of 4 disks

Important: Concerning the space used by the RAID on each member disk, we must make sure to stay below 20TB so that we can be sure that any "20TB" disk can be used as replacement for a broken one

This is necessary because hard disk models' exact capacities may vary slightly above their advertised size, and if a replacement disk is of a different model that is just a tiny bit smaller than these ones it cannot be used in the RAID array if it was created using all available disk space, and that would just ruin our day

Create the RAID on the raw, unpartitioned disks

Some guides recommend partitioning the disks to restrict the space that the RAID uses on each member disk, but that is bad practice, and here's why (skip over the bullet points if you don't care why):

The one – and arguably weak – argument usually made for partitioning the disks is that, if you in a bout of uncommon stupidity were to attach an unpartitioned RAID member disk to a particularly dumb operating system, that operating system might misidentify it as an empty disk and automatically partition and format it, and boom, that'd be one corrupt RAID member disk

Calculating mdadm --size=

Intention: We only want to use the first 19999GB of each disk, so that on a disk that is exactly 20TB a single GB remains unused at its end

But the --size= parameter assumes binary prefixes (KiB, MiB, GiB…), not decimal prefixes (kB, MB, GB…)

So, we need to convert those 19999GB to GiB. Full calculation as described:

((20*1000*1000*1000*1000) - (1*1000*1000*1000)) / 1024 / 1024 / 1024
((20TB in bytes         ) - (1GB in bytes    )) -> KiB -> MiB -> GiB
 = 18625GiB (rounded down to whole GiB)

Then again, who cares about another unused GiB per disk; better safe than sorry, so here we use 18624GiB

[root@the-server ~]# mdadm --create /dev/md/raid --verbose --homehost=the-server --name=raid --raid-devices=6 --size=18624G --level=6 --bitmap=internal /dev/sd[abcdef]

mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/raid started.

In retrospect the order of disks could have been given according to ascending serial number

Review detailed RAID information

[root@the-server ~]# mdadm --detail /dev/md/raid

/dev/md/raid:
           Version : 1.2
     Creation Time : Fri Feb 21 11:04:42 2024
        Raid Level : raid6
        Array Size : 78114717696 (72.75 TiB 79.99 TB)
     Used Dev Size : 19528679424 (18.19 TiB 20.00 TB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Feb 21 11:06:14 2024
             State : clean, resyncing 
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

     Resync Status : 0% complete

              Name : the-server:raid
              UUID : 8b76d832:6ca64e3a:af5bc01e:1471551a
            Events : 20

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      active sync   /dev/sdb
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf

For the subsequent steps the chunk size of 512KiB is of interest

Wait until the RAID has synced … or don't

A newly created RAID needs to calculate and write its initial parity information and it will start doing so immediately; this is called "syncing"

Syncing happens transparently in the background; it may take a long time but is interruptible, e.g. if you reboot your PC and start the RAID again it will automatically resume syncing from where it was when it was stopped

Also while the sync is running the RAID may already be used albeit likely with reduced performance

Anyhow, you can check the sync progress by fetching the RAID status:

[root@the-server ~]# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0]
      78114717696 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
      [>....................]  resync =  0.2% (54722848/19528679424) finish=1603.2min speed=202445K/sec
      bitmap: 146/146 pages [584KB], 65536KB chunk

unused devices: <none>

Here we can also see that the actual RAID block device is /dev/md127; we need this later

Increase sustained write speed

Before we start using it for real we are going to set up encryption on the RAID and then fill the whole thing with encrypted zeros, and for that (and for later use too of course) we want to ensure that the RAID can write with its maximum possible speed

We do this by increasing the size of the stripe cache, because its default size usually bottlenecks the RAID write speed

What even is a RAID stripe?

Each RAID member disk is sectioned into chunks of 512KiB; this is the "chunk size" we have seen above in the detailed RAID view

The RAID logic, i.e. the parity information calculation and data recovery, operates on groups of such chunks that consist of one chunk from each member disk, from the same chunk index, and such a group is called a "stripe"

This being a RAID 6, within each stripe 2 chunks hold parity/recovery information and the remaining 4 chunks contain effective data

Now, consider: Altering some small amount of data in a single chunk invalidates the parity information of the two parity chunks in that chunk's stripe, which must then be recalculated and updated on-disk, and on top of that, for parity recalculation the RAID logic needs the data from the other 3 data chunks, which it therefore has to read…

If you think "that sounds like it could be slow" you'd be right, which is why we want to only write whole stripes to the RAID if at all possible, because then the stripe's parity chunks can be calculated up front in memory without having to read anything from the RAID, and this is why there is a "stripe cache"

What is the stripe cache?

The stripe cache is a reserved area in RAM where writes to a specific RAID are collected before they are written to the disks, with the goal of accumulating whole stripes that can then be written faster

It is set to a default size of 256 when the RAID is assembled, and can be changed while the RAID is online

…Holup, 256 what? Chunks, stripes, MiB, bananas? How much RAM is that? Unfortunately official documentation seems scarce, but hearsay has it it's (size * memory page size * total RAID disk count), so with the usual page size of 4KiB that would be (256 * 4KiB * 6) = 6144KiB = 6MiB, which does seem small indeed

We'll crank it to 8192 which according to that formula will use (8192 * 4KiB * 6) = 196608KiB = 192MiB of memory

[root@the-server ~]# echo 8192 > /sys/class/block/md127/md/stripe_cache_size

Important: This setting is not persisted into the RAID configuration and must be set again each time after the RAID has been assembled

Typically you write an udev rule that does this and is triggered after the RAID has been assembled

Example: Text file "/etc/udev/rules.d/60-md-stripe-cache-size.rules"

# Set the RAID stripe cache size to 8192 for any RAID that is assembled on this system
SUBSYSTEM=="block", KERNEL=="md*", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}!="8192", ATTR{md/stripe_cache_size}="8192"

Create the encryption layer on the RAID

We use dm-crypt for encryption and create a LUKS2 volume

This particular dm-crypt volume will be unlocked automatically on startup with the crypttab mechanism using a key file that we will create and add a key slot for later

But we also want to be able to unlock the volume on its own in an emergency with a regular passphrase, and we create it with this passphrase now

Use key slot 1 for the passphrase because key slot 0 will be used for the key file, so that in the regular use case (automatic unlocking at system startup) unlocking is quicker because slot 0 will be tried first

Now this here is the first step where so-called "stripe alignment" must be considered

Make the LUKS2 data segment start at a stripe boundary

Stripe alignment is basically "having stuff on the RAID start at a stripe boundary" or more generally "being smart about where to write stuff to the RAID so that partial stripe writes are avoided"

The XFS file system that we will put on top of the LUKS2 volume will be configured to do just that, but making this work hinges on having the encryption layer that is sitting between the RAID and the file system not introducing "sub-stripe-width shift" so to speak

So, stripe alignment: Our stripe width, as it is perceived from anything that uses the RAID's storage space, is (512KiB chunk size * 4 effective disks) = 2MiB

The dm-crypt LUKS2 default data segment offset is 16MiB, which is an exact multiple of 2MiB, so we are good as-is and do not need to specify a custom --offset at the next whole-stripe boundary beyond 16MiB

[root@the-server ~]# cryptsetup luksFormat --type luks2 --verify-passphrase --key-slot 1 --label srv.raid-80tb.encrypted /dev/md127

WARNING!
========
This will overwrite data on /dev/md127 irrevocably.

Are you sure? (Type 'yes' in capital letters): YES
Enter passphrase for /dev/md127:
Verify passphrase:

To give an example where we are not so lucky, let's say we'd have created the RAID 6 with one more disk, so 7 disks altogether, which yields the effective space of 5 disks

That would mean a stripe width of (512KiB chunk size * 5 effective disks) = 2560KiB = 2.5MiB, which 16MiB is not an exact multiple of: 16MiB / (512KiB * 5) = 6.4

Since we want the offset to be at least the default 16MiB our target offset for the data section would be at 7 times the stripe width, i.e. 7 * (512KiB * 5 effective disks) = 17920KiB

The --offset argument requires the offset to be supplied as number of 512B sectors, so we need to convert these KiB to sectors:

(17920 * 1024) / 512
(  KiB ->   B) ->  s
 = 35840s

Accordingly, we would add these two arguments to the command line: --offset 35840

Review encryption details

[root@the-server ~]# cryptsetup luksDump /dev/md127

LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
UUID:           3d1b3de7-cc0c-4fe4-81e2-270d38966ef7
Label:          srv.raid-80tb.encrypted
Subsystem:      (no subsystem)
Flags:          (no flags)

Data segments:
  0: crypt
    offset: 16777216 [bytes]
    length: (whole device)
    cipher: aes-xts-plain64
    sector: 4096 [bytes]

Keyslots:
  1: luks2
    Key:        512 bits
    Priority:   normal
    Cipher:     aes-xts-plain64
    Cipher key: 512 bits
    PBKDF:      argon2id
    Time cost:  12
    Memory:     1048576
    Threads:    4
    Salt:       dc 6b 2b 54 94 63 3a e7 1b f1 c4 c3 5e 43 00 f6
                fc 54 75 da f6 ba 7a 13 3e bb 72 b1 1d 7c 60 ba
    AF stripes: 4000
    AF hash:    sha256
    Area offset:32768 [bytes]
    Area length:258048 [bytes]
    Digest ID:  0
Tokens:
Digests:
  0: pbkdf2
    Hash:       sha256
    Iterations: 332998
    Salt:       ae 91 73 42 d5 d6 ed b7 83 d5 f2 43 3b 18 04 87
                e2 40 26 23 80 e7 ae 7f a3 4f 20 d8 19 1c ab 9d
    Digest:     4f e6 a3 83 40 7a d4 65 24 84 dc 69 e4 f3 43 a7
                c2 2e 28 ee e2 94 7b 9d 4d b8 4e 96 14 aa 46 6a

The data segment offset is 16777216 bytes, which is indeed 16MiB:

16777216B / 1024 / 1024
    bytes -> KiB -> MiB
 = 16MiB

Unlock the encrypted device

[root@the-server ~]# cryptsetup open /dev/md127 srv.raid-80tb

Enter passphrase for /dev/md127:

Overwrite the encrypted block device with zeros

This causes the underlying RAID device to be filled with what looks like random data (the encrypted zeros), disguising how much space is actually used

[root@the-server ~]# dd if=/dev/zero iflag=fullblock of=/dev/mapper/srv.raid-80tb oflag=direct bs=128M status=progress

79989336702976 bytes (80 TB, 73 TiB) copied, 159750 s, 501 MB/s
dd: error writing '/dev/mapper/srv.raid-80tb': No space left on device
595968+0 records in
595967+0 records out
79989454143488 bytes (80 TB, 73 TiB) copied, 159751 s, 501 MB/s

Unlike the RAID sync this step does not automatically resume when it is interrupted, so be prepared to have your PC run for a good whole day on end

Create an XFS file system

XFS (and other file systems too) can apply optimizations for underlying RAID chunks and stripes if they know about them, which is what we want

Fortunately for us, mkfs.xfs correctly detects the RAID's chunk size and stripe width automatically, even "through" the dm-crypt layer

[root@the-server ~]# mkfs.xfs -L raid-80tb -m bigtime=1,rmapbt=1 /dev/mapper/srv.raid-80tb

meta-data=/dev/mapper/srv.raid-80tb isize=512    agcount=73, agsize=268435328 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=19528675328, imaxpct=1
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

In the output, review the "data" section:

Does that match the RAID geometry, which is required for good performance?

The XFS "stripe unit" (sunit) is the RAID's "chunk size" as it is listed in the detailed RAID information (512KiB); they must be the same

Here it is expressed in blocks, so (128 * 4KiB block size) = 512KiB, same as the RAID chunk size, that tracks

As mentioned above, the stripe width is the stripe unit (RAID chunk size) multiplied by 4 effective disks, (512KiB * 4) = 2MiB

The mkfs.xfs output lists the stripe width (swidth) as 512 blocks, which is (512 * 4KiB block size) = 2MiB, so yes, we are good

Mount the file system

This does not require any fancy business with mount options, the defaults work fine

For example, to mount the XFS file system at "/mnt/raid-80tb" we'd do this:

[root@the-server ~]# mount /dev/mapper/srv.raid-80tb /mnt/raid-80tb

We can see the stripe unit (sunit) and stripe width (swidth) in the mount options when we look at the mounted file system:

[root@the-server ~]# mount | grep /mnt/raid-80tb

/dev/mapper/srv.raid-80tb on /mnt/raid-80tb type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=4096,noquota)

This time though they are expressed in number of 512 byte blocks, it says so in the "MOUNT OPTIONS" section of the XFS user manual

At this point your block device stack should look like this:

[root@the-server ~]# lsblk --merge

    NAME            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
┌┈> sda               8:0    0  18.2T  0 disk  
├┈> sdb               8:16   0  18.2T  0 disk  
├┈> sdc               8:32   0  18.2T  0 disk  
├┈> sdd               8:48   0  18.2T  0 disk  
├┈> sde               8:64   0  18.2T  0 disk  
└┬> sdf               8:80   0  18.2T  0 disk  
 └┈┈md127             9:127  0  72.8T  0 raid6 
    └─srv.raid-80tb 254:1    0  72.7T  0 crypt /mnt/raid-80tb
    (…)

Automatic RAID assembly at system startup

So now we have our mounted file system on top of an encrypted RAID, which is nice and all, but as soon as the system reboots we'll just have some RAID member disks and we'll need to again start the RAID, unlock the encryption, and mount the file system

Ain't nobody got time for that

People came up with computers to automate stuff after all

To automatically assemble and start the RAID at system startup put this text into the file "/etc/mdadm.conf", replacing the UUID with the one of your RAID:

# RAID 6x20TB
DEVICE /dev/sd[abcdefghijklmnopqrstuvwxyz]
ARRAY /dev/md127 metadata=1.2 UUID=8b76d832:6ca64e3a:af5bc01e:1471551a

This causes the system to scan all available /dev/sdX devices for members of a RAID that has that UUID, and assemble and try to start that RAID during startup

Automatic encryption unlocking at system startup

This is done with the regular crypttab mechanism, exactly the same as for any other regular non-root encrypted block device, with a key file

Create and add a key file to the LUKS2 volume

Basically, we create a small random-data file and use its contents as another "passphrase"

Important: This assumes that your system's root file system is already encrypted (a.k.a. "encrypted root", where you are prompted for a passphrase or have to provide a secret via some other means very early during startup to unlock the root file system), so that you may save the key file on the root file system without compromising the RAID's encryption

You absolutely must not save this key file to unencrypted storage

Also, you must save the key file to the root file system in a way that only root may access it, regular users must not be able to read it

As to where on the root file system it is saved, this is up to you, here the key file is put into a new restricted-permissions root-level directory "/local-secrets"

[root@the-server ~]# mkdir --mode=0700 /local-secrets

Create the key file with "/dev/random" and "dd" (1KiB of random binary data)

[root@the-server ~]# dd if=/dev/random iflag=fullblock of=/local-secrets/keyfile-raid-80tb.bin bs=1K count=1

1+0 records in
1+0 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 3.7677e-05 s, 27.2 MB/s

Restrict access to the file to root only

[root@the-server ~]# chmod u=rw,go= /local-secrets/keyfile-raid-80tb.bin

Assign the key file to key slot 0 of the LUKS2 volume

[root@the-server ~]# cryptsetup luksAddKey --key-slot 0 --new-keyfile /local-secrets/keyfile-raid-80tb.bin /dev/md127

WARNING: The --key-slot parameter is used for new keyslot number.
Enter any existing passphrase: 

When looking at the output of "cryptsetup luksDump /dev/md127", in the "Keyslots" section there should now be a new key slot 0

Manually unlocking the encrypted RAID using the key file instead of the passphrase can be done as such

[root@the-server ~]# cryptsetup --key-file=/local-secrets/keyfile-raid-80tb.bin open /dev/md127 srv.raid-80tb

Add an entry to /etc/crypttab

To automatically unlock the encrypted RAID on system startup with the key file, put this text into "/etc/crypttab"

# RAID 6x20TB
srv.raid-80tb /dev/md127 /local-secrets/keyfile-raid-80tb.bin

Automatically mount the XFS file system at system startup

This is textbook "/etc/fstab" stuff, for example like this

# RAID 6x20TB
/dev/mapper/srv.raid-80tb /mnt/raid-80tb xfs defaults 0 2

TODO Manual Disassembly and assembly of the whole stack

TODO Top-down disassembly: umount -> cryptsetup close -> mdadm --stop

TODO Bottom-up assembly: mdadm --start … -> cryptsetup open … -> mount

TODO Mention mdadm --readwrite for if after system startup the RAID is running but is in read-only mode

TODO Data recovery, i.e. read-only access

TODO mdadm --readonly, cryptsetup --readonly, mount -o ro,norecovery