Page 1 of 1

ZFS tweaking

Posted: Wed May 24, 2023 5:49 pm
by zemerdon
I'm going to take another stab at helping you understand what's going on. I'm really sorry my post is so long, but I'm sick today and lack the time & attention span to really shrink its length.

Let's cover the easy case first: L2ARC. This is designed to accelerate very small random reads. In fact, there's a governor on L2ARC unless you tweak it! It won't cache anything larger than 32k. What you're doing above is writing a very long string 4k at a time; it's going to be saved by ZFS as a 128k (some version) or 1024k (other versions) record. Which basically means L2ARC is going to give you two middle fingers and nope the heck out of the workload; it thinks it's large blocks and sequential, and won't help you out at all.

L2ARC also has feed rate limitations, and by default won't feed faster than -- if I recall correctly -- around 40mbytes/sec. So if your ARC eviction rate is faster than that, stuff will expire before being fully committed to L2ARC. The ideal L2ARC workload is a long, slow feed of random latency-sensitive data, followed by very intense re-read of that data. In truth, I'm not very impressed with L2ARC for most workloads; just add RAM.

Now a little harder case: SLOG. A lot of people think, "Hey! Add SLOG SSD, it makes things FASTER!" but that's not really true. SLOG is designed to coalesce lots of random writes into one, large sequential IO. In essence, it converts an IOPS-oriented operation into a throughput-oriented operation. Since most hard drives have decent throughput (50mbyte/sec or so for writes, 70-90mbyte/sec for 7200RPM) but fairly shitty random write IOPS (~80something for a 7200RPM drive), this is usually a good trade to make: make most of the writes sequential, and your performance goes up.

But wait! There's more!

ZFS by default will write your new zpool out using "ashift=9" rather than "ashift=12" to match your devices. What this means in practice is that for many single writes, there's actually a read and a wait period to re-write data. It's painful.

To make matters worse, you chose RAIDz2 for a 4-disk array. With 4k blocks but an 8k stripe width, the write pattern is going to be "4kdata, 4k parity, 4k parity". You're crippling the potential throughput of your writes because you're writing 2 parity blocks for every single 4k data block. In essence, your test is limiting the throughput of your zpool to a single disk. Seriously, you may as well set up just one disk with no parity; you'll get the same performance out of it for this test.

So OK. You have a zpool maybe capable of 50-70mbyte/sec with 4k writes, assuming it's a run-of-the-mill 7200RPM drive and 512e mode isn't biting you very hard. You've crippled your 4-drive array down to the effective throughput of 1 drive (or less, really). Now the uberblock updates come into play: you have 2 uberblock updates per disk with each txg_sync. These are in a fixed position on the platter, so there are at least 3 head seeks involved every 5 seconds for your write test.

Now you throw a SLOG in front of this, and say "I COMMAND YOU, ACCELERATE MY DATA!" But there's no way it can do very much. The test you're running right now, in fact, more or less just queues to RAM and streams to the zpool as fast as the zpool can go because you're not syncing every 4k anyway (it's not a synchronous workload).

The heart of ZFS is the zpool vdevs. They are the engine that allows everything else to work. And in your case, you took a good 4-cylinder engine for your vehicle, and by the nature of your test and format you crippled it down to a single cylinder, then strapped a turbocharger onto it and wonder why it won't perform.

You will always lose performance strapping drives into arrays. You're sacrificing performance for reliability. That's the whole game with RAID anyway.

So start over. To give yourself a chance of success, I'd suggest this:

Reformat your pool, with the expectation you're going to use 4k blocks (if that's your goal here). Set ashift=12 at format time or even -- as /u/mercenary_sysadmin has been pointing out lately -- ashift=13 for 8k. 4-disk RAIDz2 is even worse than mirror; it's like mirror4 for a 4k workload. RAIDz1 or mirror with this few disks (and in fact, I don't recommend 4-disk RAIDz1 anymore; 5-disk minimum IMHO). Or buy two more disks for a RAIDz2 test with a reasonable chance of success.
When you're LOADING your data, set the recordsize parameter to 4k. This will carve up the data into 4k blocks, rather than one big-ass file that doesn't exercise the SLOG at all, but just your RAIDz2 pool. This will also make it so that L2ARC can actually, you know, get used instead of ignored if you're testing that too. I think fio can also do this if you force a fsync every 4k rather than once at the end of the file.
To really exercise your SLOG, what you'd want to do is have a bunch of overwrites to the same sets of blocks over & over instead of a big asynchronous write to new blocks.
To really exercise your L2ARC & ARC, what you'd want to do is have a bunch of reads to the same blocks over & over, where the quantity of the blocks is just slightly larger than your total L2ARC size.
To really hammer the SLOG, set up a tar xvf on a big archive with lots of small files over NFS. You'll usually see tremendous acceleration for this kind of load that ends up being a bunch of tiny syncs.
More reading material:

SLOG basics. You need to understand what it's doing to understand what it accelerates. https://blogs.oracle.com/brendan/entry/slog_screenshots
ZIL pipelining. I think there are some OpenZFS changes landing which are similar to this, but until recently ZFS could not fully-utilize extreme-write-performance SSDs. Until around 2012, mainstream SSD write throughput was typically even worse than HDD writes; they only ruled HDDs for IOPS. That's changed, and storage vendors are changing to take advantage of the shift. https://blogs.oracle.com/roch/entry/zil_pipelinening
Improvements to L2ARC, feed rates, and more: https://blogs.oracle.com/roch/entry/it_ ... dawning_of
SLOG is actually a hindrance if your goal is maximum throughput at the expense of everything else, like your test above seems to be. Chuck it in the bin and throw more vdevs at your load. SLOG and L2ARC are about improving random IO, not sequential IO. http://www.slideshare.net/albertspijker ... rage-zs3-2
And finally, an obligatory shill link to my blog: https://blogs.oracle.com/storageops/ent ... e_zfs_when

I hope that helps.