Version 1.4.1-alpha.5
2024, 2025 eomanis
PGP signature
How to create an encrypted software RAID 6 on Linux
For cold storage (lots of data that, once written, does not change much, but may be read now and then)
A guide that gets shit done, along with some dos and don'ts with whys and whyn'ts
This guide is about setting up a single bigass stripe-aligned XFS file system on top of a single dm-crypt volume on top of a single RAID
Nothing more, nothing less; a resilient low-complexity solution that is known to work reliably
If you do not want encryption you may leave out the respective steps, i.e. you create the file system directly on the RAID
You may also substitute XFS with a different file system type if you are so inclined
Here are our 6 20TB disks that will be assembled to a RAID 6, on top of which we'll put a dm-crypt volume, on top of which we'll create an XFS file system that is aligned to the RAID geometry for optimal performance:
[root@the-server ~]# ls -l /dev/disk/by-id | grep ata-TOSHIBA_
lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXUXXXX -> ../../sde lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXOXXXX -> ../../sdb lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXFXXXX -> ../../sdf lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXAXXXX -> ../../sdc lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXIXXXX -> ../../sda lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXX6XXXX -> ../../sdd
A RAID 6 uses 2 disks for parity/recovery information, and since we have 6 disks that'll get us the effective storage space of 4 disks
Important: Concerning the space used by the RAID on each member disk, we must make sure to stay below 20TB so that we can be sure that any "20TB" disk can be used as replacement for a broken one
This is necessary because hard disk models' exact capacities may vary slightly above their advertised size, and if a replacement disk is of a different model that is just a tiny bit smaller than these ones it cannot be used in the RAID array if it was created using all available disk space, and that would just ruin our day
Create the RAID on the raw, unpartitioned disks
Some guides recommend partitioning the disks to restrict the space that the RAID uses on each member disk, but that is bad practice, and here's why (skip over the bullet points if you don't care why):
The one – and arguably weak – argument usually made for partitioning the disks is that, if you in a bout of uncommon stupidity were to attach an unpartitioned RAID member disk to a particularly dumb operating system, that operating system might misidentify it as an empty disk and automatically partition and format it, and boom, that'd be one corrupt RAID member disk
Intention: We only want to use the first 19999GB of each disk, so that on a disk that is exactly 20TB a single GB remains unused at its end
But the --size= parameter assumes binary prefixes (KiB, MiB, GiB…), not decimal prefixes (kB, MB, GB…)
So, we need to convert those 19999GB to GiB. Full calculation as described:
((20*1000*1000*1000*1000) - (1*1000*1000*1000)) / 1024 / 1024 / 1024 ((20TB in bytes ) - (1GB in bytes )) -> KiB -> MiB -> GiB = 18625GiB (rounded down to whole GiB)
Then again, who cares about another unused GiB per disk; better safe than sorry, so here we use 18624GiB
[root@the-server ~]# mdadm --create /dev/md/raid --verbose --homehost=the-server --name=raid --raid-devices=6 --size=18624G --level=6 --bitmap=internal /dev/sd[abcdef]
mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: chunk size defaults to 512K mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md/raid started.
In retrospect the order of disks could have been given according to ascending serial number
[root@the-server ~]# mdadm --detail /dev/md/raid
/dev/md/raid: Version : 1.2 Creation Time : Fri Feb 21 11:04:42 2024 Raid Level : raid6 Array Size : 78114717696 (72.75 TiB 79.99 TB) Used Dev Size : 19528679424 (18.19 TiB 20.00 TB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri Feb 21 11:06:14 2024 State : clean, resyncing Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Resync Status : 0% complete Name : the-server:raid UUID : 8b76d832:6ca64e3a:af5bc01e:1471551a Events : 20 Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/sda 1 8 16 1 active sync /dev/sdb 2 8 32 2 active sync /dev/sdc 3 8 48 3 active sync /dev/sdd 4 8 64 4 active sync /dev/sde 5 8 80 5 active sync /dev/sdf
For the subsequent steps the chunk size of 512KiB is of interest
A newly created RAID needs to calculate and write its initial parity information and it will start doing so immediately; this is called "syncing"
Syncing happens transparently in the background; it may take a long time but is interruptible, e.g. if you reboot your PC and start the RAID again it will automatically resume syncing from where it was when it was stopped
Also while the sync is running the RAID may already be used albeit likely with reduced performance
Anyhow, you can check the sync progress by fetching the RAID status:
[root@the-server ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] md127 : active raid6 sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0] 78114717696 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU] [>....................] resync = 0.2% (54722848/19528679424) finish=1603.2min speed=202445K/sec bitmap: 146/146 pages [584KB], 65536KB chunk unused devices: <none>
Here we can also see that the actual RAID block device is /dev/md127; we need this later
Before we start using it for real we are going to set up encryption on the RAID and then fill the whole thing with encrypted zeros, and for that (and for later use too of course) we want to ensure that the RAID can write with its maximum possible speed
We do this by increasing the size of the stripe cache, because its default size usually bottlenecks the RAID write speed
Each RAID member disk is sectioned into chunks of 512KiB; this is the "chunk size" we have seen above in the detailed RAID view
The RAID logic, i.e. the parity information calculation and data recovery, operates on groups of such chunks that consist of one chunk from each member disk, from the same chunk index, and such a group is called a "stripe"
This being a RAID 6, within each stripe 2 chunks hold parity/recovery information and the remaining 4 chunks contain effective data
Now, consider: Altering some small amount of data in a single chunk invalidates the parity information of the two parity chunks in that chunk's stripe, which must then be recalculated and updated on-disk, and on top of that, for parity recalculation the RAID logic needs the data from the other 3 data chunks, which it therefore has to read…
If you think "that sounds like it could be slow" you'd be right, which is why we want to only write whole stripes to the RAID if at all possible, because then the stripe's parity chunks can be calculated up front in memory without having to read anything from the RAID, and this is why there is a "stripe cache"
The stripe cache is a reserved area in RAM where writes to a specific RAID are collected before they are written to the disks, with the goal of accumulating whole stripes that can then be written faster
It is set to a default size of 256 when the RAID is assembled, and can be changed while the RAID is online
…Holup, 256 what? Chunks, stripes, MiB, bananas? How much RAM is that? Unfortunately official documentation seems scarce, but hearsay has it it's (size * memory page size * total RAID disk count), so with the usual page size of 4KiB that would be (256 * 4KiB * 6) = 6144KiB = 6MiB, which does seem small indeed
We'll crank it to 8192 which according to that formula will use (8192 * 4KiB * 6) = 196608KiB = 192MiB of memory
[root@the-server ~]# echo 8192 > /sys/class/block/md127/md/stripe_cache_size
Important: This setting is not persisted into the RAID configuration and must be set again each time after the RAID has been assembled
Typically you write an udev rule that does this and is triggered after the RAID has been assembled
Example: Text file "/etc/udev/rules.d/60-md-stripe-cache-size.rules"
# Set the RAID stripe cache size to 8192 for any RAID that is assembled on this system
SUBSYSTEM=="block", KERNEL=="md*", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}!="8192", ATTR{md/stripe_cache_size}="8192"
We use dm-crypt for encryption and create a LUKS2 volume
This particular dm-crypt volume will be unlocked automatically on startup with the crypttab mechanism using a key file that we will create and add a key slot for later
But we also want to be able to unlock the volume on its own in an emergency with a regular passphrase, and we create it with this passphrase now
Use key slot 1 for the passphrase because key slot 0 will be used for the key file, so that in the regular use case (automatic unlocking at system startup) unlocking is quicker because slot 0 will be tried first
Now this here is the first step where so-called "stripe alignment" must be considered
Stripe alignment is basically "having stuff on the RAID start at a stripe boundary" or more generally "being smart about where to write stuff to the RAID so that partial stripe writes are avoided"
The XFS file system that we will put on top of the LUKS2 volume will be configured to do just that, but making this work hinges on having the encryption layer that is sitting between the RAID and the file system not introducing "sub-stripe-width shift" so to speak
So, stripe alignment: Our stripe width, as it is perceived from anything that uses the RAID's storage space, is (512KiB chunk size * 4 effective disks) = 2MiB
The dm-crypt LUKS2 default data segment offset is 16MiB, which is an exact multiple of 2MiB, so we are good as-is and do not need to specify a custom --offset at the next whole-stripe boundary beyond 16MiB
[root@the-server ~]# cryptsetup luksFormat --type luks2 --verify-passphrase --key-slot 1 --label srv.raid-80tb.encrypted /dev/md127
WARNING! ======== This will overwrite data on /dev/md127 irrevocably. Are you sure? (Type 'yes' in capital letters): YES Enter passphrase for /dev/md127: Verify passphrase:
To give an example where we are not so lucky, let's say we'd have created the RAID 6 with one more disk, so 7 disks altogether, which yields the effective space of 5 disks
That would mean a stripe width of (512KiB chunk size * 5 effective disks) = 2560KiB = 2.5MiB, which 16MiB is not an exact multiple of: 16MiB / (512KiB * 5) = 6.4
Since we want the offset to be at least the default 16MiB our target offset for the data section would be at 7 times the stripe width, i.e. 7 * (512KiB * 5 effective disks) = 17920KiB
The --offset argument requires the offset to be supplied as number of 512B sectors, so we need to convert these KiB to sectors:
(17920 * 1024) / 512 ( KiB -> B) -> s = 35840s
Accordingly, we would add these two arguments to the command line: --offset 35840
[root@the-server ~]# cryptsetup luksDump /dev/md127
LUKS header information Version: 2 Epoch: 3 Metadata area: 16384 [bytes] Keyslots area: 16744448 [bytes] UUID: 3d1b3de7-cc0c-4fe4-81e2-270d38966ef7 Label: srv.raid-80tb.encrypted Subsystem: (no subsystem) Flags: (no flags) Data segments: 0: crypt offset: 16777216 [bytes] length: (whole device) cipher: aes-xts-plain64 sector: 4096 [bytes] Keyslots: 1: luks2 Key: 512 bits Priority: normal Cipher: aes-xts-plain64 Cipher key: 512 bits PBKDF: argon2id Time cost: 12 Memory: 1048576 Threads: 4 Salt: dc 6b 2b 54 94 63 3a e7 1b f1 c4 c3 5e 43 00 f6 fc 54 75 da f6 ba 7a 13 3e bb 72 b1 1d 7c 60 ba AF stripes: 4000 AF hash: sha256 Area offset:32768 [bytes] Area length:258048 [bytes] Digest ID: 0 Tokens: Digests: 0: pbkdf2 Hash: sha256 Iterations: 332998 Salt: ae 91 73 42 d5 d6 ed b7 83 d5 f2 43 3b 18 04 87 e2 40 26 23 80 e7 ae 7f a3 4f 20 d8 19 1c ab 9d Digest: 4f e6 a3 83 40 7a d4 65 24 84 dc 69 e4 f3 43 a7 c2 2e 28 ee e2 94 7b 9d 4d b8 4e 96 14 aa 46 6a
The data segment offset is 16777216 bytes, which is indeed 16MiB:
16777216B / 1024 / 1024 bytes -> KiB -> MiB = 16MiB
[root@the-server ~]# cryptsetup open /dev/md127 srv.raid-80tb
Enter passphrase for /dev/md127:
This causes the underlying RAID device to be filled with what looks like random data (the encrypted zeros), disguising how much space is actually used
[root@the-server ~]# dd if=/dev/zero iflag=fullblock of=/dev/mapper/srv.raid-80tb oflag=direct bs=128M status=progress
79989336702976 bytes (80 TB, 73 TiB) copied, 159750 s, 501 MB/s dd: error writing '/dev/mapper/srv.raid-80tb': No space left on device 595968+0 records in 595967+0 records out 79989454143488 bytes (80 TB, 73 TiB) copied, 159751 s, 501 MB/s
Unlike the RAID sync this step does not automatically resume when it is interrupted, so be prepared to have your PC run for a good whole day on end
XFS (and other file systems too) can apply optimizations for underlying RAID chunks and stripes if they know about them, which is what we want
Fortunately for us, mkfs.xfs correctly detects the RAID's chunk size and stripe width automatically, even "through" the dm-crypt layer
[root@the-server ~]# mkfs.xfs -L raid-80tb -m bigtime=1,rmapbt=1 /dev/mapper/srv.raid-80tb
meta-data=/dev/mapper/srv.raid-80tb isize=512 agcount=73, agsize=268435328 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=19528675328, imaxpct=1 = sunit=128 swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
In the output, review the "data" section:
Does that match the RAID geometry, which is required for good performance?
The XFS "stripe unit" (sunit) is the RAID's "chunk size" as it is listed in the detailed RAID information (512KiB); they must be the same
Here it is expressed in blocks, so (128 * 4KiB block size) = 512KiB, same as the RAID chunk size, that tracks
As mentioned above, the stripe width is the stripe unit (RAID chunk size) multiplied by 4 effective disks, (512KiB * 4) = 2MiB
The mkfs.xfs output lists the stripe width (swidth) as 512 blocks, which is (512 * 4KiB block size) = 2MiB, so yes, we are good
This does not require any fancy business with mount options, the defaults work fine
For example, to mount the XFS file system at "/mnt/raid-80tb" we'd do this:
[root@the-server ~]# mount /dev/mapper/srv.raid-80tb /mnt/raid-80tb
We can see the stripe unit (sunit) and stripe width (swidth) in the mount options when we look at the mounted file system:
[root@the-server ~]# mount | grep /mnt/raid-80tb
/dev/mapper/srv.raid-80tb on /mnt/raid-80tb type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=4096,noquota)
This time though they are expressed in number of 512 byte blocks, it says so in the "MOUNT OPTIONS" section of the XFS user manual
At this point your block device stack should look like this:
[root@the-server ~]# lsblk --merge
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS ┌┈> sda 8:0 0 18.2T 0 disk ├┈> sdb 8:16 0 18.2T 0 disk ├┈> sdc 8:32 0 18.2T 0 disk ├┈> sdd 8:48 0 18.2T 0 disk ├┈> sde 8:64 0 18.2T 0 disk └┬> sdf 8:80 0 18.2T 0 disk └┈┈md127 9:127 0 72.8T 0 raid6 └─srv.raid-80tb 254:1 0 72.7T 0 crypt /mnt/raid-80tb (…)
So now we have our mounted file system on top of an encrypted RAID, which is nice and all, but as soon as the system reboots we'll just have some RAID member disks and we'll need to again start the RAID, unlock the encryption, and mount the file system
Ain't nobody got time for that
People came up with computers to automate stuff after all
To automatically assemble and start the RAID at system startup put this text into the file "/etc/mdadm.conf", replacing the UUID with the one of your RAID:
# RAID 6x20TB DEVICE /dev/sd[abcdefghijklmnopqrstuvwxyz] ARRAY /dev/md127 metadata=1.2 UUID=8b76d832:6ca64e3a:af5bc01e:1471551a
This causes the system to scan all available /dev/sdX devices for members of a RAID that has that UUID, and assemble and try to start that RAID during startup
This is done with the regular crypttab mechanism, exactly the same as for any other regular non-root encrypted block device, with a key file
Basically, we create a small random-data file and use its contents as another "passphrase"
Important: This assumes that your system's root file system is already encrypted (a.k.a. "encrypted root", where you are prompted for a passphrase or have to provide a secret via some other means very early during startup to unlock the root file system), so that you may save the key file on the root file system without compromising the RAID's encryption
You absolutely must not save this key file to unencrypted storage
Also, you must save the key file to the root file system in a way that only root may access it, regular users must not be able to read it
As to where on the root file system it is saved, this is up to you, here the key file is put into a new restricted-permissions root-level directory "/local-secrets"
[root@the-server ~]# mkdir --mode=0700 /local-secrets
Create the key file with "/dev/random" and "dd" (1KiB of random binary data)
[root@the-server ~]# dd if=/dev/random iflag=fullblock of=/local-secrets/keyfile-raid-80tb.bin bs=1K count=1
1+0 records in 1+0 records out 1024 bytes (1.0 kB, 1.0 KiB) copied, 3.7677e-05 s, 27.2 MB/s
Restrict access to the file to root only
[root@the-server ~]# chmod u=rw,go= /local-secrets/keyfile-raid-80tb.bin
Assign the key file to key slot 0 of the LUKS2 volume
[root@the-server ~]# cryptsetup luksAddKey --key-slot 0 --new-keyfile /local-secrets/keyfile-raid-80tb.bin /dev/md127
WARNING: The --key-slot parameter is used for new keyslot number. Enter any existing passphrase:
When looking at the output of "cryptsetup luksDump /dev/md127", in the "Keyslots" section there should now be a new key slot 0
Manually unlocking the encrypted RAID using the key file instead of the passphrase can be done as such
[root@the-server ~]# cryptsetup --key-file=/local-secrets/keyfile-raid-80tb.bin open /dev/md127 srv.raid-80tb
To automatically unlock the encrypted RAID on system startup with the key file, put this text into "/etc/crypttab"
# RAID 6x20TB srv.raid-80tb /dev/md127 /local-secrets/keyfile-raid-80tb.bin
This is textbook "/etc/fstab" stuff, for example like this
# RAID 6x20TB /dev/mapper/srv.raid-80tb /mnt/raid-80tb xfs defaults 0 2
TODO Top-down disassembly: umount -> cryptsetup close -> mdadm --stop
TODO Bottom-up assembly: mdadm --start … -> cryptsetup open … -> mount
TODO Mention mdadm --readwrite for if after system startup the RAID is running but is in read-only mode
TODO mdadm --readonly, cryptsetup --readonly, mount -o ro,norecovery