[转帖]Full Storage Stack Optimization

full,storage,stack,optimization · 浏览次数 : 0

小编点评

## Summary of the Ext3 I/O stack performance analysis: **Key findings:** * The EXT3 filesystem performs significantly worse than EXT4 and BTRFS for sequential I/O due to limitations in the RAID implementation and the request-based data access method. * The performance of EXT3 drops significantly at higher offset values due to the long outside tracks and the need to access multiple sectors. * The use of the plugin mechanism in RAID improves the performance of sequential writes for EXT3, but it still falls far behind the ideal case. * The EXT3 filesystem exhibits high latency for fsync operations due to the journaling system employed for data integrity. * Switching the mount option to `data=writeback` can significantly improve the performance of sequential writes by minimizing metadata flushes and reducing journaling overhead. * The overall throughput of EXT3 is comparable to that of EXT4 and BTRFS, but the significant drop in performance at high offset values makes it unsuitable for demanding use cases. **Additional insights:** * The high deviation in EXT3's fsync latency suggests that the journaling system introduces significant overhead. * The reliance on kernel threads for transaction handling in EXT3/JBD can lead to performance bottlenecks, especially for sequential writes. * The performance gains from the plugin mechanism can be partially attributed to the improved data access patterns for sequential writes in EXT4 and BTRFS. **Further analysis recommendations:** * Benchmarking EXT3 with different mount options and journaling configurations to determine the optimal configuration for specific use cases. * Investigating the cause of the high fsync latency in EXT3 and exploring potential solutions to improve its performance. * Comparing the performance of EXT3 with other file systems like BTRFS in more extensive testing scenarios.

正文

https://zhuanlan.zhihu.com/p/138888453
复制

Background

EXT3 has been infamous for its slowness in our product for a long time, it's heard many times we should upgrade to a more modern filesystem such as EXT4. Since it is the interface of I/O stack to the users, it is reasonable to blame EXT3 when we encountered the slow response from filesystem. However, without any thorough understanding of the problem, it is too soon to draw a correct conclusion. After all, there are many layers on I/O stack, and filesystem is just one of them, so we'd better to have a thorough look at the whole stack, from the application layer down to the disk. The simplified IO stack is as follows:

The IO stack has the following layers, from the bottom up:

HDD. Although we have armed SSD on some products, this article is focusing on the magnetic storage, especially the Hard Disk Drive (HDD).
Multipath. Its implementation in Linux community has already switched to request-based, we still use the bio-based implementation. With request-based dm it's able to apply an iosched if necessary, bio-based is not.
RAID. A RAID implementation built from the ground up
EXT3. The filesystem

The data service on the right is not our focus here, the userspace applications on the left fall into two catagories, we should try best to satisify the one with high IO priority or the custom might be impacted, the others such as tar the log files are of low impact.

One thing not mentioned yet is that a NVRAM is used for performance, HDD itself cannot afford the required performance.

The RAID for the APP is as follows:

raid1: stripe unit size 4KB
raid0: stripe unit size 64KB

Optimization

HDD Performance characteristics

The access time of HDD is a measure of the time it takes before the drive can actually transfer data. It is composed of 2 parts typically, the seek time and rotational latency. Usually I/O scheduler attempts to merge and dispatch the I/O requests in a specific order in order to the reduce the seek and rotational time. However, nowadays even with the sequential I/O, the performance of inside and outside track are not the same as the bit density is constant so that the long outside tracks have more bits than inside tracks, then outside tracks have better performance.

From the test results, it's clear that the throughput varies a lot at the different offset, we should take this into consideration if throughput matters.

I/O Stack Throughput

Now we know the disk performance characteristic, the next step is to understand it across the whole I/O stack. As we are testing the sequential I/O, the ideal throughput of the filesystem can be calculated with this formula:

ideal_throughput(subarray) = throughput(dm) * (8 / 9)
ideal_throughput(array) = ideal_throughput(subarray) * 4
ideal_throughput(ext3) = ideal_throughput(array)

We call the raid1 above subarray, and raid0 array. In our RAID, there is an extra sector for each stripe unit, so for the 4KB (8 sectors) stripe unit, it has to access 9 sectors.

We can draw these conclusions according to the test results above:

EXT3 / RAID array is far away from the ideal case.
The subarray does matches the performance of disk. However, the final array drops a lot.
The filesystem is not the bottleneck, it almost gets every penny from the raid layer, so there is no reason to switch to another filesystem from this perspective. One thing needs to mention is that this test is done in a clean filesystem with plenty of space, the filesystem performance might drop respectively if the filesystem turns to be full and fragmented.
Multipath is just as fine as hard disk, as it has little to do in the performance path.

RAID Improvement

The stripe unit size of raid1 is 4KB. In order to service the requests larger than the SU size, raid1 breaks them into 4KB pieces, then submits them individually. So the plugging mechanism in block layer is likely to be the answer. A patch is made, briefly speaking, it tries to submit the under layer requests together if they come from the same up layer request. It doesn't mean to make things better, its purpose is not to make things worse.

After applying the plugging mechanism, we can see that the throughput is improved for the sequential I/O, especially for reads, it's very close to optimal.

However for writes, there is still huge room to optimize. If we use iostat or any kind of I/O monitorers, we can see that the requests are not large enough to the underlayer for the sequential writes. It looks impossible to have small requests after enabling plugging mechanism. After some debugging, I found that there is a constraint of raid, which causes the system at most have these amount of data in flight to disk: 400 * 4KB = 1600KB It's a pretty small value especially in the case of 4 subarrays, please note that all the subarrays compete for this same pool. That's why we see the drop on the super array, but for one subarray. After setting it to a larger value, the throughput sequential write boosts too.

EXT3 vs EXT4 vs BTRFS

I also did a basic comparison between EXT3, EXT4 and BTRFS. From the test results, we can see that EXT3 filesystem might not be the best choice, especially for the write tests.

The comparison between EXT3, EXT4 and BTRFS deserves a separate analysis, I am not going that further. However we can explore a little bit from the plotting of the fio results. EXT3 had more deviation, EXT4 and BTRFS was more consistent, although consistence has no strict relation with throughput.

EXT3 fsync

In order to measure the latency of the write/fsync operations, strace is used to capture the time spent in the syscalls. Strace is just good enough for the accuracy and overhead. In this test, I use another fio process in the background generating the I/O as many as possible, this is to simulate the case of I/O pressure. As write() is almost always short in the test (buffer write), I will plot the time of fsync() only. The latency is unacceptable, it takes about 20 seconds to complete one 4KB fsync().

What if the iosched is switched to cfq for the RAID devices? The time spent for each fsync() is reduced to 2s from 20s, it's a huge improvement although 2s is still not good enough. The reason of this optimization is that cfq gives the synchronous requests more priority than the asynchronous requests.

Now it's the time to think about this 2s latency. As we know, EXT3 supports several data integrity options including journal, ordered and writeback.

Journal - All data is committed into the journal prior to being written into the main filesystem.
Ordered - This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.
Writeback - Data ordering is not preserved, data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

Ordered mode is currently used. The following is a very simplified version about EXT3 and JBD, which doesn't include many details. In order to maintain the internal filesystem integrity, EXT3/JBD employs the "handle" to track a series of low level operations which has to be atomic. For the sake of performance, these handles are grouped into "transactions" and committed to disk together. For each filesystem, there is only one kernel thread handling the transactions in FIFO. Let's see the situation of the filesystem mounted with data=ordered, there are 2 two restricts:

Data is flushed prior to Metadata
Metadata is flushed in FIFO

So that in order to fsync the Data 6 (D6), it not only has to flush M6, it also has to flush all D{1..6} and M{1..6}. But for data=writeback, it only has to flush D6 and M{1..6}.

This is the result of data=writeback test. Most of the requests are done in 0.4s, it's a big win if we can switch the mount option to data=writeback.

Recap

HDD throughput varies because the bit density is constant but the outside track is longer
Plugging is good for request merge, it boosts the throughput of raid up to 2x
the ordered mount option of ext3 introduces dependency on data for metadata i/o, which might cause slow fsync
iosched for HDD does matter a lot