Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring_ide

ABSTRACT

Recent high-performance storage devices have exposed software inefficiencies in existing storage stacks, leading to a new breed of I/O stacks. The newest storage API of the Linux kernel is io_uring. We perform one of the first in-depth studies of io_uring, and compare its performance and dis/advantages with the established libaio and SPDK APIs. Our key findings reveal that (i) polling design significantly impacts performance; (ii) with enough CPU cores io_uring can deliver performance close to that of SPDK; and (iii) performance scalability over multiple CPU cores and devices requires careful consideration and necessitates a hybrid approach. Last, we provide design guidelines for developers of storage intensive applications. 最近的高性能存储设备暴露了现有存储堆栈中的软件效率低下,从而导致了一种新的I/O堆栈。Linux内核最新的存储API是io_uring。我们对io_uring进行了第一次深入研究,并将其性能和缺点/优势与已建立的libaio和SPDK API进行了比较。我们的主要发现表明:
(i)轮询设计显著影响性能;
(ii)有足够的CPU核io_uring可以提供接近SPDK的性能;
(iii)在多个CPU核心和设备上的性能可扩展性需要仔细考虑,并且需要混合方法。
最后,我们为存储密集型应用程序的开发人员提供了设计指南。

INTRODUCTION

Modern non-volatile memory (NVM) storage technologies, like Flash and Optane SSDs, can support down to single digit 𝜇second latencies, and up to multi GB/s bandwidth with millions of I/O operations per second (IOPS). CPU performance improvements have stalled over the past years due to various manufacturing and technical limitations [8]. 现代非易失性存储器(NVM)存储技术,如Flash和Optane SSD,可以支持低至个位数的存储𝜇秒延迟和高达多GB/s的带宽以及每秒数百万次I/O操作(IOPS)。由于各种制造和技术限制,CPU性能的改进在过去几年中停滞不前[8]。

As a result, researchers have put considerable effort into identifying new CPU-efficient storage APIs, abstractions, designs, and optimizations [2, 3, 11, 13, 15, 19, 22, 25, 26, 30, 31]. One specific API, io_uring, has drawn much attention from the community due to its versatile and high performance interface [5, 15, 16, 18, 27, 34]. io_uring was introduced in 2019 and has been merged in Linux v5.1. It brings together many well established ideas from the high performance storage and networking communities, such as asynchronous I/O, shared memory-mapped queues, and polling (Section 2) [9, 10, 31, 32]. 因此,研究人员在识别新的CPU高效存储API、抽象、设计和优化方面付出了相当大的努力[2,3,11,13,15,19,22,25,26,30,31]。一个特定的API,io_uring,由于其多功能和高性能接口而引起了社区的广泛关注[5,15,16,18,27,34]。io_uring于2019年推出,现已在Linux v5.1中合并。它汇集了来自高性能存储和网络社区的许多成熟思想,如异步I/O、共享内存映射队列和轮询(第2节)[9,10,31,32]。

With the addition of io_uring, Linux now has multiple ways of accessing a storage device. In this paper, we look at Linux Asynchronous I/O (libaio) [6, 24], the Storage Performance Development Kit (SPDK) from Intel® [13], and io_uring [15, 17, 18]. These APIs have different parameters, deployment models, and characteristics, which make understanding their performance and limitations a challenging task. The use of the io_uring API and its performance has been the focus of recent studies [7, 28, 33, 36]. However, to the best of our knowledge, there is no systematic study of these APIs that provides design guidelines for the developers of I/O intensive applications. There has also been an extensive body of work in studying system call overhead [29], implementing better interrupt management for I/O devices [30], leveraging polling for fast storage devices [38], using I/O speculation for 𝜇second-scale devices such as NVMe drives [35], and improving the performance of the Linux block layer in general [3, 39, 40]. These works are orthogonal to ours, since they explore designing new storage stacks, while we focus on the performance characteristics of state-of-the-art APIs that are readily available in Linux.
随着io_uring的加入,Linux现在有多种访问存储设备的方式。在本文中,我们将研究Linux异步I/O(libaio)[6,24]、Intel®[13]的存储性能开发工具包(SPDK)和io_uring[15,17,18]。这些API具有不同的参数、部署模型和特性,这使得了解它们的性能和局限性成为一项具有挑战性的任务。io_uring API的使用及其性能一直是最近研究的焦点[7,28,33,36]。然而,据我们所知,还没有对这些API进行系统的研究,为I/O密集型应用程序的开发人员提供设计指南。在研究系统调用开销[29]、为I/O设备实现更好的中断管理[30]、利用快速存储设备的轮询[38]、使用I/O推测𝜇第二级设备,如NVMe驱动器[35],并总体上提高Linux块层的性能[3,39,40]。这些工作与我们的工作是正交的,因为它们探索设计新的存储堆栈,而我们专注于Linux中现成的最先进API的性能特征。

Our main contributions include (i) a systematic comparison of libaio, io_uring, and SPDK, that evaluates their latency, IOPS, and scalability behaviors; (ii) a first-of-its-kind detailed evaluation of the different io_uring configurations; and (iii) design guidelines for high-performance applications using modern storage APIs. Our key findings reveal that: 我们的主要贡献包括
(i)对libaio、io_uring和SPDK进行了系统比较,评估了它们的延迟、IOPS和可扩展性行为;
(ii)对不同io_uring配置的第一次详细评估
(iii)使用现代存储API的高性能应用程序的设计指南。
我们的主要发现表明:
Not all polling methods are created equal. We evaluate different polling mechanisms. SPDK offers a single userspace polling mechanism for both submission and completion, while io_uring offers two options that can be enabled independently: polling for completion, and kernel-thread polling for submission. We observe that polling can be both the key to achieving high performance and the cause of order-of-magnitude performance losses (Section 3.1). 并非所有的轮询方法都是平等创建的。我们评估不同的轮询机制。SPDK为提交和完成提供了一个单一的用户空间轮询机制,而io_uring提供了两个可以独立启用的选项:完成轮询和提交内核线程轮询。我们观察到,轮询既是实现高性能的关键,也是造成数量级性能损失的原因(第3.1节)。
io_uring is close to SPDK. io_uring with kernel polling can deliver performance close to SPDK (within 10%), thanks to the elimination of system calls from the I/O path. However, this performance needs twice as many CPU cores as SPDK (Section 3.2). io_uring接近SPDK。由于消除了I/O路径中的系统调用,带有内核轮询的io_uring可以提供接近SPDK的性能(在10%以内)。然而,这种性能需要两倍于SPDK的CPU内核(第3.2节)。
Performance scalability warrants careful considerations. When not enough CPU cores are available, io_uring with kernel polling can lead to a collapse of performance. Hence, a hybrid and scale-aware approach needs to be taken for the selection of the API and its parameters (Section 3.3). 性能可扩展性值得仔细考虑。当没有足够的CPU内核可用时,使用内核轮询进行io_uring可能会导致性能崩溃。因此,API及其参数的选择需要采用混合和规模意识方法(第3.3节)。

Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring_ide_02

LIBAIO, SPDK, IO_URING: A PRIMER

libaio. The Linux asynchronous API allows applications to interact with any block device (HDD, SATA SSDs, and NVMe SSDs) in an asynchronous fashion [6, 24]. The main benefits of libaio are its ease of use, flexibility, and high performance compared to the traditional blocking I/O APIs. The design of libaio revolves around two main system calls: io_submit to submit I/O requests to the kernel, and io_getevents to retrieve the completed I/O requests. The main limitation of libaio is its per I/O performance overhead [20, 33], which stems from relying on two system calls per I/O operation, using interrupt-based completion notifications, and copying meta-data [15, 33]. Moreover, libaio only supports unbuffered accesses (i.e., with O_DIRECT) [15, 18].
libaio。Linux异步API允许应用程序以异步方式与任何块设备(HDD、SATA SSD和NVMe SSD)交互[6,24]。与传统的阻塞I/O API相比,libaio的主要优点是其易用性、灵活性和高性能。libaio的设计围绕着两个主要的系统调用:io_submit向内核提交I/O请求,io_getevents检索已完成的I/O请求。libaio的主要限制是其每次I/O的性能开销[20,33],这源于每次I/O操作依赖两个系统调用、使用基于中断的完成通知和复制元数据[15,33]。此外,libaio仅支持无缓冲访问(即使用O_DIRECT)[15,18]。

SPDK. Introduced by Intel in 2010, SPDK [13] is the de facto high-performance API in Linux, used by many projects [14, 20, 23, 37, 41]. SPDK implements a zero-interrupt, zero-copy, poll-driven NVMe driver in user space. PCIe registers are mapped to user space to configure submission (SQs) and completion (CQs) queues shared between a device and an application. I/O requests are submitted to the SQs and completions are polled from the CQs without the need for interrupts or system calls. The downsides of SPDK are its increased complexity and reduced scope of usability with respect to libaio: SPDK does not support Linux file system integration, and cannot benefit from many kernel storage services such as access control, QoS, scheduling, and quota management.
SPDK。SPDK[13]由Intel于2010年推出,是Linux中事实上的高性能API,被许多项目使用[14,20,23,37,41]。SPDK在用户空间中实现了零中断、零拷贝、轮询驱动的NVMe驱动程序。PCIe寄存器被映射到用户空间,以配置设备和应用程序之间共享的提交(SQ)和完成(CQ)队列。I/O请求被提交到SQ,并且完成被从CQ轮询,而不需要中断或系统调用。相对于libaio,SPDK的缺点是其复杂性增加,可用性范围缩小:SPDK不支持Linux文件系统集成,并且无法从访问控制、QoS、调度和配额管理等许多内核存储服务中获益。

io_uring. io_uring aims to bridge the gap between the ease of use and flexibility of libaio and the high performance of SPDK. io_uring (i) implements a shared memory-mapped, queue-driven request/response processing framework; (ii) supports POSIX asynchronous data accesses both on direct and buffered I/O (iii) works with different block devices (e.g., HDDs, SATA SSDs and NVME SSDs) and with any file system (and files). io_uring achieves low meta-data copy and system call overhead by implementing two ring data structures that are mapped into user space and shared with the kernel. The submission ring contains the I/O request posted by the application. The completion ring contains the results of completed I/O requests. The application can insert and retrieve I/O entries by updating the head/tail pointers of the rings atomically, without using system calls.
io_uring。io_uring旨在弥合libaio的易用性和灵活性与SPDK的高性能之间的差距。io_uring(i)实现了共享内存映射、队列驱动的请求/响应处理框架;(ii)支持直接和缓冲I/O上的POSIX异步数据访问(iii)与不同的块设备(例如HDD、SATA SSD和NVME SSD)以及任何文件系统(和文件)一起工作。io_uring通过实现映射到用户空间并与内核共享的两个环形数据结构,实现了较低的元数据复制和系统调用开销。提交环包含应用程序发布的I/O请求。完成环包含已完成I/O请求的结果。应用程序可以通过原子更新环的头/尾指针来插入和检索I/O条目,而无需使用系统调用。

io_uring can perform I/O in different ways. Figure 1 provides a visual representation of the different io_uring I/O modes, which we describe below and evaluate in Section 3. By default, the application notifies the kernel about new requests in the submission ring using the io_uring_enter system call. As the completion ring is mapped in user space, the application can check for completed I/O by polling, without issuing any system calls. Alternatively, the same io_uring_enter system call can be used to wait for completed I/O requests. io_uring_enter supports both an interrupt-driven (Figure 1a) and a polling-based (Figure 1b) I/O completion. The io_uring_enter system call can be used to submit new I/O requests and at the same time wait for completed I/O requests. This allows reducing the number of system calls per I/O. io_uring further supports an operational mode that requires no system calls. In this mode, io_uring spawns a kernel thread (one per io_uring context) that continuously polls the submission ring for new I/O requests (Figure 1c).

io_uring可以以不同的方式执行I/O。图1提供了不同io_uring I/O模式的可视化表示,我们将在下面进行描述,并在第3节中进行评估。默认情况下,应用程序使用io_uring_enter系统调用向内核通知提交环中的新请求。由于完成环映射在用户空间中,应用程序可以通过轮询来检查已完成的I/O,而无需发出任何系统调用。或者,可以使用相同的io_uring_enter系统调用来等待完成的I/O请求。io_uring_enter支持中断驱动(图1a)和基于轮询(图1b)的I/O完成。io_uring_enter系统调用可用于提交新的I/O请求,同时等待已完成的I/O请求。这样可以减少每次I/O的系统调用次数。io_uring进一步支持不需要系统调用的操作模式。在这种模式下,io_uring生成一个内核线程(每个io_uring上下文一个),该线程持续轮询提交环以获取新的I/O请求(图1c)。

Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring_ide_03

PERFORMANCE EVALUATION

Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring_ide_04


In this section, we compare the performance of libaio, io_uring, and SPDK, using fio [1] as the workload generator. We used fio because of (i) its flexibility in generating I/O workloads; (ii) its low-overhead I/O path; and (iii) its full support for SPDK and different io_uring configurations. We configured fio to perform random data reads at the granularity of 4KiB using unbuffered I/O. We chose a read-only workload because it allows higher IOPS on our drives with respect to a mixed workload or a write-only one [12]. The higher IOPS allowed a better evaluation of the scalability trends of the different APIs and of the effects of the overhead per I/O operation (e.g., system calls). We used the default values for the workload generation parameters. We also used the default configuration parameters for each API, except for io_uring, which we evaluated under three different configurations: (i) iou: uses io_uring_enter to submit new I/O, and uses io_uring_enter with interrupts to wait for completed requests if none are available upon submission (default in fio, Figure 1a); (ii) iou+p: same as iou except that it uses polling instead of interrupts in io_uring_enter (hipri parameter in fio, Figure 1b); (iii) iou+k: uses the kernel poller thread for I/O submission, and uses polling to become aware of completed I/O (sqthread_poll parameter in fio, Figure 1c) – iou+k has zero system call overhead per I/O. Table 1 describes our benchmarking setup.

在本节中,我们将使用fio[1]作为工作负载生成器,比较libaio、io_uring和SPDK的性能。我们使用fio是因为(i)它在生成i/O工作负载方面具有灵活性;(ii)其低开销I/O路径;以及(iii)其对SPDK和不同io_uring配置的完全支持。我们将fio配置为使用无缓冲I/O以4KiB的粒度执行随机数据读取。我们选择只读工作负载是因为相对于混合工作负载或仅写工作负载,它允许在驱动器上实现更高的IOPS[12]。更高的IOPS允许更好地评估不同API的可扩展性趋势以及每个I/O操作(例如,系统调用)的开销的影响。我们使用了工作负载生成参数的默认值。除了io_uring之外,我们还使用了每个API的默认配置参数,我们在三种不同的配置下对其进行了评估:

(i)iou:使用io_uring_enter提交新的i/O,如果提交时没有可用的请求,则使用带中断的io_uring_enter来等待完成的请求(fio中的默认值,图1a);

(ii)iou+p:与iou相同,只是它在io_uring_enter中使用轮询而不是中断(fio中的hipri参数,图1b);

(iii)iou+k:使用内核轮询器线程进行I/O提交,并使用轮询来了解已完成的I/O(fio中的sqthread_poll参数,图1c)–iou+k每次I/O的系统调用开销为零。

表1描述了我们的基准测试设置。

Understanding Polling

We first measured the performance of the three libraries when running a single fio job that targets a single NVMe drive, using multiple queue depths (from 1 to 128). We ran this experiment in two variants, with one or two CPU cores [ Other cores disabled using /sys/devices/system/cpu/cpuN/online ],1 placed on the same NUMA node as the drive. Figure 2a and Figure 2b report the IOPS obtained for different queue depth values with one core and two cores, respectively. Figure 2c and Figure 2d report the median latencies corresponding to the IOPS values obtained on one core and two cores, respectively. We also ran the tests on three cores without observing any significant difference. We note that some libraries exhibit latency functions with the shape of a ‘hook’. This is due to the fact that, for these libraries, increasing the queue depth past the saturation point leads to a latency increase and a slight decrease in IOPS, due to increased overhead. Figure 3 reports the average number of system calls per I/O operation (which is the same on one and two cores). The results lead to three primary observations. 我们首先测量了三个库在运行针对单个NVMe驱动器的单个fio作业时的性能,使用多个队列深度(从1到128)。我们在两种变体中运行了这个实验,一个或两个CPU内核[使用/sys/devices/system/CPU/cpuN/online禁用其他内核],1放置在与驱动器相同的NUMA节点上。图2a和图2b分别报告了一个内核和两个内核的不同队列深度值所获得的IOPS。图2c和图2d分别报告了与在一个核心和两个核心上获得的IOPS值相对应的延迟中值。我们还对三个岩芯进行了测试,没有观察到任何显著差异。我们注意到,一些库展示了“钩子”形状的延迟函数。这是因为,对于这些库,将队列深度增加到饱和点以上会导致延迟增加,并且由于开销增加,IOPS略有下降。图3报告了每个I/O操作的平均系统调用次数(在一个和两个内核上相同)。这些结果导致了三个主要的观察结果。

Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring_API_05

First, with a single core, iou+k suffers a catastrophic performance loss delivering only 13 KIOPS, i.e., one order of magnitude less than the other APIs. In this configuration, the fio thread and the kernel poller thread share the single CPU in mutual exclusion. The kernel thread takes up a significant share of the CPU cycles (50% according to our perf tracing), which leads to delays in processing the I/O requests in fio. The median latency of iou+k is 8 msec, i.e., one or two orders of magnitude worse than that of the other two APIs. The median latency of iou+k does not vary with throughput, because it is determined by the interleaving dynamics described earlier, rather than by queueing effects as it is the case for the other libraries. With two cores (one for fio and one for kernel thread), the performance of iou+k recovers completely, being second only to that of SPDK: the maximum throughput of iou+k is 18% lower than SPDK’s, and the median latency of iou+k is, up to 200 KIOPS, equal or within 10% of SPDK. 首先,对于单核,iou+k遭受了灾难性的性能损失,只提供了13个KIOPS,即比其他API少一个数量级。在这种配置中,fio线程和内核轮询器线程共享互斥的单个CPU。内核线程占据了CPU周期的很大一部分(根据我们的性能跟踪,占50%),这导致了在fio中处理I/O请求的延迟。iou+k的中值延迟为8毫秒,即比其他两个API的延迟差一个或两个数量级。iou+k的中值延迟不随吞吐量而变化,因为它是由前面描述的交织动态决定的,而不是像其他库那样由排队效应决定的。在两个内核(一个用于fio,一个用于内核线程)的情况下,iou+k的性能完全恢复,仅次于SPDK:iou+k的最大吞吐量比SPDK低18%,iou+k的中位延迟高达200 KIOPS,等于或在SPDK的10%以内。

Second, SPDK delivers the best performance in every configuration. With just one core, SPDK achieves 305 KIOPS versus the 171 KIOPS and 145 KIOPS of the best io_uring alternative and libaio, respectively. With two cores, SPDK achieves 313 KIOPS, vs the 260 KIOPS and 150 KIOPS of iou+k and libaio, respectively. Moreover, SPDK is the only library capable of saturating the bandwidth of the drive, while all other approaches are CPU-bound. Part of this efficiency can be traced down to SPDK’s optimized software stack with zero system call overhead, zero-copy and polling-based I/O (see Figure 3). Despite embracing the same polling-based approSPDK’sach, iou+k cannot achieve the same performance as SPDK. iou+k, in fact, runs polling on two threads, the application one and the kernel one, both accessing the same shared variables and data structures, which incur overhead from atomic accesses, memory fences and cache invalidations. SPDK, instead, implements polling with the application in the same thread, allowing higher resource efficiency. As an example, on two cores and with a queue depth of 16 (where iou+p and SPDK have similar throughput), iou+p experiences a cache miss rate of 5% versus 0.6% for SPDK (cache-misses counter in perf). 其次,SPDK在每种配置中都能提供最佳性能。只有一个核心,SPDK实现了305个KIOPS,而最佳io_uring替代方案和libaio分别实现了171个和145个KIOPS。通过两个核心,SPDK实现了313个KIOPS,而iou+k和libaio分别实现了260个和150个KIOPS。此外,SPDK是唯一能够使驱动器带宽饱和的库,而所有其他方法都受CPU限制。这种效率的一部分可以追溯到SPDK的优化软件堆栈,它具有零系统调用开销、零拷贝和基于轮询的I/O(见图3)。尽管采用了与SPDK相同的基于轮询的方法,iou+k仍无法实现与SPDK一样的性能。事实上,iou+k在两个线程上运行轮询,一个是应用程序线程,另一个是内核线程,这两个线程都访问相同的共享变量和数据结构,这会导致原子访问、内存围栏和缓存无效带来的开销。相反,SPDK在同一线程中实现了与应用程序的轮询,从而实现了更高的资源效率。例如,在两个内核上,队列深度为16(其中iou+p和SPDK具有相似的吞吐量),iou+p的缓存未命中率为5%,而SPDK的缓存未击中率为0.6%(性能中为缓存未击中计数器)。

Third, regardless of the number of cores, iou+p achieves performance that is comparable with SPDK for low to medium throughput values (up to ≈ 150 KIOPS). This result is explained by the fact that, at low queue depths, the system call overhead in iou+p is not yet so high as to be a bottleneck, and hence the polling implemented by iou+p is as effective as the polling implemented by SPDK. At higher queue depths, however, the system call overhead becomes the bottleneck for iou+p, leading to performance that is worse than SPDK’s. A similar dynamic can also be observed with iou and libaio. Up to a queue depth of 16, they achieve very similar throughput (79 KIOPS and 72 KIOPS, respectively) and median latency (185 𝜇sec and 190 𝜇sec, respectively). However, as the depth increases (> 16 in Figure 2a and Figure 2b), the higher CPU efficiency of iou, which incurs fewer system calls per I/O operation than libaio, helps to deliver better performance (182 KIOPS of peak throughput versus 151 KIOPS on two cores). 第三,无论核心数量如何,iou+p在中低吞吐量值(高达≈150 KIOPS)方面都能实现与SPDK相当的性能。这一结果可以通过以下事实来解释:在低队列深度下,iou+p中的系统调用开销还没有高到成为瓶颈,因此,iou/p实现的轮询与SPDK实现的轮询一样有效。然而,在较高的队列深度下,系统调用开销成为iou+p的瓶颈,导致性能比SPDK差。iou和libaio也可以观察到类似的动态。在队列深度为16的情况下,它们实现了非常相似的吞吐量(分别为79 KIOPS和72 KIOPS)和中位延迟(185𝜇秒和190𝜇秒)。然而,随着深度的增加(图2a和图2b中>16),iou的CPU效率更高,每次I/O操作产生的系统调用比libaio更少,有助于提供更好的性能(峰值吞吐量为182 KIOPS,而两个核上为151 KIOPS)。

Interestingly, up to queue depth=16 iou incurs fewer system calls per I/O on average than iou+p, despite achieving worse latency than iou+p. This happens because, at low queue depths, there is a higher probability that after submitting all the I/O requests, fio has to wait for at least one I/O request to be completed. Then, the delay caused by the interrupt handler in iou allows for processing more completions at once, at the expense of latency. iou+p, instead, reaps a completed request as soon as it is available, thus potentially missing out on opportunities to batch. As the queue depth increases, both approaches converge to an average of one system call per I/O operation, and iou+p achieves an higher throughput by eschewing the interrupt handling overhead. We note that the results that we have reported differ from other experimental results reported online with a single physical core [16]. This discrepancy is due to the fact that such previous results have been obtained with more powerful SSDs and CPUs, an optimized benchmarking tool, and an experimental Linux kernel version [21]. 有趣的是,队列深度=16的iou平均每次I/O的系统调用次数比iou+p少,尽管其延迟比iou+p差。这是因为,在低队列深度下,在提交所有I/O请求后,fio必须等待至少一个I/O请求完成的可能性更高。然后,iou中的中断处理程序导致的延迟允许以延迟为代价一次处理更多的完成。相反,iou+p会在完成的请求可用时立即获取该请求,从而可能错过批处理的机会。随着队列深度的增加,这两种方法都收敛到平均每个I/O操作一个系统调用,iou+p通过避免中断处理开销实现了更高的吞吐量。我们注意到,我们报告的结果与其他在线报告的单个物理核心的实验结果不同[16]。这种差异是由于之前的结果是通过更强大的SSD和CPU、优化的基准测试工具和实验性Linux内核版本获得的[21]。

Different CPU-to-drive ratios

In light of the results discussed so far, we studied the performance of the APIs using more than one drive. In particular, we aimed to observe how many CPUs per drive iou+k needs, in the general case, to obtain the best performance and avoid the performance degradation of the one core scenario described earlier. We ran a test in which fio runs 𝐽 = 5 jobs, each accessing a distinct drive, with a queue depth of 128, and we enabled a different number of cores𝐶 on the machine. We set 𝐶 = 𝐽, 𝐽 + 1, 𝐽 + 2, 𝐽 ∗ 2. We used 𝐽 = 5 because it corresponds to the largest setting such that all the experiments could run in a single NUMA domain (10 cores per domain). We remark that iou+k spawns one kernel poller thread per fio job. Figure 4 reports the results of our experiments. 根据到目前为止讨论的结果,我们研究了使用多个驱动器的API的性能。特别是,我们旨在观察在一般情况下,每个驱动器iou+k需要多少CPU才能获得最佳性能,并避免前面描述的单核场景的性能下降。我们进行了一个测试,其中fio运行𝐽 = 5个作业,每个作业访问一个不同的驱动器,队列深度为128,我们启用了不同数量的内核𝐶 在机器上。我们设置𝐶 = 𝐽, 𝐽 + 1.𝐽 + 2.𝐽 ∗ 2.我们曾经𝐽 = 5,因为它对应于最大设置,使得所有实验可以在单个NUMA域中运行(每个域10个核)。我们注意到iou+k为每个fio作业生成一个内核轮询线程。图4显示了我们的实验结果。

The results indicate that iou+k is the only library that benefits significantly from higher CPU-to-drive ratios. The other libraries only marginally benefit from additional available CPUs. iou+k, in particular, needs twice as many CPUs as drives to achieve the highest throughput, indicating that each polling kernel thread needs a dedicated CPU to achieve the best performance. When iou+k can run with a dedicated core per kernel polling thread, it achieves throughput that is only ≈ 15% lower than SPDK, ≈ 45% higher than iou+p and iou, and ≈ 80% higher than libaio. These values suggest that iou+k can achieve remarkable performance without needing a complete rewrite of an application, as is the case with SPDK. iou+k, however, can be the worst-performing solution if the number of extra cores is not optimal. In our case with 5 jobs, allocating just two extra cores to iou+k leads to throughput that is ≈ 20% lower than libaio, and allocating no extra cores leads iou+k’s throughput to plummet to less than half the throughput of libaio. 结果表明,iou+k是唯一一个从更高的CPU与驱动器比率中显著受益的库。其他库只能从额外的可用CPU中获得少量好处。iou+k尤其需要两倍于驱动器的CPU才能实现最高吞吐量,这表明每个轮询内核线程都需要一个专用CPU才能实现最佳性能。当iou+k可以使用每个内核轮询线程的专用内核运行时,其吞吐量仅比SPDK低≈15%,比iou+p和iou高≈45%,比libaio高≈80%。这些值表明iou+k可以在不需要完全重写应用程序的情况下获得显著的性能,就像SPDK的情况一样。然而,如果额外内核的数量不是最优的,iou+k可能是性能最差的解决方案。在我们有5个作业的情况下,只为iou+k分配两个额外的核心会导致吞吐量比libaio低≈20%,而不分配额外的核心则会导致iou+k的吞吐量暴跌至libaio吞吐量的一半以下。

These results shed light on the inherent provisioning costs incurred by iou+k to achieve high performance, which were overlooked by previous experimental results that did not take into account the CPU-to-drive ratio when analyzing the performance achievable by iou+k [16]. 这些结果揭示了iou+k为实现高性能而产生的固有供应成本,而之前的实验结果忽略了这些成本,在分析iou+k可实现的性能时,这些实验结果没有考虑CPU与驱动器的比率[16]。

Scalability

We now present the results obtained when running the different libraries on a number of drives varying from 1 to 20 to study their scalability. We configured fio to run 𝐽 jobs, with 𝐽 ranging from 1 to 20, each accessing a different drive. In light of the results presented so far, we ran the experiments with 𝐶 = 2𝐽 cores (up to a maximum of 𝐶 = 20, which is the number of physical cores available on our machine). We uniformly spread the drives and cores across the two NUMA domains of the machine when 𝐽 > 1. We ran the tests with a queue depth of 128 to measure close to the peak throughput achievable by the APIs. Figure 5 reports the results of the experiments. 我们现在展示了在1到20个驱动器上运行不同库时获得的结果,以研究其可扩展性。我们将fio配置为运行𝐽 工作,具有𝐽 范围从1到20,每个访问不同的驱动器。根据目前的结果,我们用𝐶 = 2.𝐽 核心(最多𝐶 = 20,这是我们机器上可用的物理核心的数量)。当𝐽 > 1.我们以128的队列深度运行测试,以测量接近API可实现的峰值吞吐量。图5显示了实验结果。

Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring_sed_06


The results showcase the implications for scalability of the dynamics that we have described in the previous sections. SPDK achieves the best performance across the board, and the second best performing library depends on the number of fio jobs executing and the number of cores available. 结果展示了我们在前几节中描述的动态的可扩展性的含义。SPDK实现了全面的最佳性能,而性能第二好的库取决于执行的fio作业的数量和可用的内核数量。

As long as 𝐽 ≤ 10, iou+k can allocate a separate core to each kernel polling thread, achieving linear scalability and throughput that is between 9% and 16% lower than SPDK’s, between 27% and 45% higher than iou and iou+p (which perform very similarly), and between 50% and 76% higher than libaio. As soon as the number of jobs 𝐽 is such that 2𝐽 > 20, however, the kernel polling threads and the application threads start to interleave their executions on the limited number of cores, leading to a gradual performance degradation of iou+k. In our setting, 𝐽 = 12 is the point where the performance of iou+k crosses those of iou+p and iou. With 𝐽 = 14, iou+k becomes the worst performing library, with throughput that is 44% lower than SPDK’s, 18% lower than iou’s and iou+p’s, and even 5% lower than libaio’s. When 𝐽 = 20, iou+k’s throughput is less than one third of that achieved by SPDK, and roughly half of that achieved by libaio and the other two iou variants. 只要𝐽 ≤ 10,iou+k可以为每个内核轮询线程分配一个单独的内核,实现线性可扩展性和吞吐量,比SPDK低9%至16%,比iou和iou+p(性能非常相似)高27%至45%,比libaio高50%至76%。一旦就业人数𝐽 是这样的2𝐽 > 20,然而,内核轮询线程和应用程序线程开始在有限数量的内核上交错执行,导致iou+k的性能逐渐下降。在我们的环境中,𝐽 = 12是iou+k的性能与iou+p和iou的性能交叉的点。具有𝐽 = 14,iou+k成为性能最差的库,吞吐量比SPDK低44%,比iou和iou+p低18%,甚至比libaio低5%。什么时候𝐽 = 20,iou+k的吞吐量不到SPDK的三分之一,大约是libaio和其他两种iou变体的一半。

In contrast, libaio, iou and iou+p maintain a rather steady scalability trend, and the latter two achieve near identical performance, as already discussed in the previous sections. From 𝐽 = 14 to 𝐽 = 20, iou and iou+p are the second best libraries, with throughput that is 33% lower than SPDK’s. For those cases, notably, libaio achieves throughput that is only 10% lower than iou+p and iou.
相比之下,libaio、iou和iou+p保持着相当稳定的可扩展性趋势,后两者实现了几乎相同的性能,如前几节所述。从…起𝐽 = 14至𝐽 = 20,iou和iou+p是第二好的库,吞吐量比SPDK低33%。对于这些情况,值得注意的是,libaio的吞吐量仅比iou+p和iou低10%。

LESSONS AND FUTURE DIRECTIONS

Lesson 1: Not all polling methods are created equal. The unified user space polling of SPDK achieves the highest performance across all APIs, by eliminating data copies and system call overhead, but also by performing all I/O operations through a single thread context. iou+k also uses no system calls and minimizes data copies, but can suffer from catastrophic performance loss if not enough extra cores are available for the kernel poller threads (Figure 2). iou+p uses a system call-aided polling scheme and eschews the need for such extra cores. iou+p can achieve similar latencies as SPDK at low throughput (Figure 2b), but cannot match the SPDK’s peak performance due to its higher system call overhead (Figure 3). 并非所有的轮询方法都是平等创建的。SPDK的统一用户空间轮询通过消除数据拷贝和系统调用开销,以及通过单个线程上下文执行所有I/O操作,实现了所有API的最高性能。iou+k也不使用系统调用并最大限度地减少数据拷贝,但如果没有足够的额外内核可用于内核轮询线程,则可能会遭受灾难性的性能损失(图2)。iou+p使用系统调用辅助轮询方案,并避免了对此类额外内核的需要。iou+p可以在低吞吐量下实现与SPDK类似的延迟(图2b),但由于其较高的系统调用开销,无法与SPDK的峰值性能相匹配(图3)。

Lesson 2: io_uring can get close to SPDK. The performance and scalability of iou+k can be similar to SPDK’s, with the crucial caveat that more cores than drives must be available on the machine to efficiently support kernel space polling. Our results recommend using twice as many CPU cores as the number of drives (Figure 4). iou+p can achieve latencies similar to SPDK under low to medium load (Figure 2b), but ultimately it cannot match the throughput and scalability of SPDK (Figure 5). Finally, iou is consistently the worst-performing configuration of io_uring, suggesting that polling is one of the key ingredients to unleashing the full potential of io_uring. io_uring可以接近SPDK。iou+k的性能和可扩展性可以与SPDK类似,但需要注意的是,机器上必须有比驱动器更多的内核才能有效地支持内核空间轮询。我们的结果建议使用两倍于驱动器数量的CPU内核(图4)。iou+p可以在中低负载下实现类似于SPDK的延迟(图2b),但最终无法与SPDK的吞吐量和可扩展性相匹配(图5)。最后,iou始终是io_uring性能最差的配置,这表明轮询是释放io_uring全部潜力的关键因素之一。

Lesson 3: Performance scalability needs careful considerations. In our largest experiment (20 drives), SPDK outperforms the second best approach (iou+p) in throughput by as much as 50%. The price to pay for these higher performance is giving up out-of-the-box Linux file support, as well as writing application logic amenable to SPDK’s polling API. If support for a file system is necessary, which is the case for most applications, then iou+k can deliver performance within 90% of SPDK, but it utilizes twice as many cores (20 vs 10). For better performance scalability when not enough cores are available, developers can use iou+p, which can match the SPDK performance at low to medium queue depths (Figure 2b). 性能可扩展性需要仔细考虑。在我们最大的实验(20个驱动器)中,SPDK的吞吐量比第二好的方法(iou+p)高出50%。这些更高性能的代价是放弃了对开箱即用Linux文件的支持,以及编写适用于SPDK轮询API的应用程序逻辑。如果对文件系统的支持是必要的(大多数应用程序都是这样),那么iou+k可以提供SPDK 90%以内的性能,但它使用的内核数量是前者的两倍(20比10)。当没有足够的内核可用时,为了获得更好的性能可扩展性,开发人员可以使用iou+p,它可以在中低队列深度下匹配SPDK性能(图2b)。

Research directions. Our study has focused on the performance of the fio microbenchmark on raw block devices. An interesting research direction is assessing the implications of different storage APIs on the end-to-end performance of more realistic I/O-intensive applications, like databases. Such applications are often built on top of file systems, incur extra overhead (e.g., synchronization) that can mask I/O path bottlenecks, and use optimizations such as I/O batching. Another open research avenue is identifying more efficient application designs with iou+k, for example, by means of a better interleaving between the application and kernel poller threads, or by sharing kernel poller threads across application threads. Finally, we note that io_uring supports I/O over sockets as well, hence its performance should be studied also in the context of networked applications. 研究方向。我们的研究重点是fio微基准在原始块器件上的性能。一个有趣的研究方向是评估不同存储API对更现实的I/O密集型应用程序(如数据库)端到端性能的影响。这样的应用程序通常构建在文件系统之上,会产生额外的开销(例如,同步),这可能会掩盖I/O路径瓶颈,并使用诸如I/O批处理之类的优化。另一个开放的研究途径是通过iou+k来识别更高效的应用程序设计,例如,通过在应用程序和内核轮询器线程之间更好地交织,或者通过在多个应用程序线程之间共享内核轮询器螺纹。最后,我们注意到io_uring也支持套接字上的I/O,因此它的性能也应该在网络应用程序的上下文中进行研究。

CONCLUSIONS

We present the first systematic study and comparison between SPDK, libaio and the emerging io_uring storage APIs on top of raw block devices. Our main findings are that polling and a low system call overhead are crucial to performance, and that io_uring can achieve performance that is close to SPDK’s, but obtaining io_uring’s best performance requires understanding its design and applying careful tuning. 我们首次对SPDK、libaio和新兴的原始块设备上的io_uring存储API进行了系统的研究和比较。我们的主要发现是,轮询和低系统调用开销对性能至关重要,io_uring可以实现接近SPDK的性能,但要获得io_uring的最佳性能,需要了解其设计并进行仔细的调整。

ACKNOWLEDGEMENTS
We thank the anonymous reviewers for their feedback. Special thanks to our shepherd, Geoff Kuennig, for his careful reading and his many insightful comments and suggestions, which greatly improved the paper. Animesh Trivedi is supported by the NWO grant number OCENW.XS3.030, Project Zero: Imagining a Brave CPU-free World!

REFERENCES
[1] Jens Axboe. Accessed: 2021-12-20. The Flexible I/O tester. https://fio.readthedocs.io/.
[2] Matias Bjørling, Abutalib Aghayev, Hans Holmberg, Aravind Ramesh,Damien Le Moal, Gregory R. Ganger, and George Amvrosiadis. 2021.ZNS: Avoiding the Block Interface Tax for Flash-based SSDs. In USENIXAnnual Technical Conference (USENIX ATC 21). USENIX Association,689–703.
[3] Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. 2013.Linux Block IO: Introducing Multi-Queue SSD Access on Multi-CoreSystems. In 6th International Systems and Storage Conference (SYSTOR 13). ACM, Article 22, 10 pages.
[4] Diego Didona, Nikolas Ioannou, Radu Stoica, and Kornilios Kourtis.2020. Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs. Proc. VLDB Endow. 14, 3 (2020), 364–377.
[5] Rust docs. Accessed: 2021-12-20. Crate io_uring. https://docs.rs/iouring/latest/io_uring/.
[6] Daniel Ehrenberg. Accessed: 2021-12-20. The Asynchronous Input/Output (AIO) interface. https://github.com/littledan/linux-aio.
[7] Gabriel Haas, Michael Haubenschild, and Viktor Leis. 2020. Exploiting Directly-Attached NVMe Arrays in DBMS. In 10th Conference on Innovative Data Systems Research (CIDR 20). www.cidrdb.org, Online Proceedings. http://cidrdb.org/cidr2020/papers/p16-haas-cidr20.pdf
[8] John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM 62, 2 (Jan 2019), 48–60.
[9] Michio Honda, Giuseppe Lettieri, Lars Eggert, and Douglas Santry.2018. PASTE: A Network Programming Interface for Non-Volatile Main Memory. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, 17–33.
[10] Jaehyun Hwang, Qizhe Cai, Ao Tang, and Rachit Agarwal. 2020. TCP == RDMA: CPU-efficient Remote Storage Access with i10. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, 127–140.
[11] Jaehyun Hwang, Midhul Vuppalapati, Simon Peter, and Rachit Agarwal. 2021. Rearchitecting Linux Storage Stack for Microsecond Latency and High Throughput. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 113–128.
[12] Intel®. Accessed: 2022-05-02. Intel® SSD DC P3600 400GB NVMe SSDs. https://ark.intel.com/content/www/us/en/ark/products/80997/intel-ssd-dc-p3600-series-400gb-2-5in-pcie-3-0-20nm-mlc.html.
[13] Intel®. Accessed: 2021-12-20. The Storage Performance Development Kit (SPDK). https://spdk.io/.
[14] Intel®. Accessed:2022-04-26. SPDK In The News. https://spdk.io/news/.
[15] Jens Axboe. Accessed: 2021-12-20. Efficient IO with io_uring, . https://kernel.dk/io_uring.pdf.
[16] Jens Axboe. Accessed: 2021-12-20. That’s it. 10M IOPS, one physical core. https://twitter.com/axboe/status/1452689372395053062.
[17] Jonathan Corbet. Accessed: 2021-12-20. Ringing in a new asynchronous I/O API. https://lwn.net/Articles/776703/.
[18] Jonathan Corbet. Accessed: 2021-12-20. The rapid growth of io_uring. https://lwn.net/Articles/810414/.
[19] Jeong-Uk Kang, Jeeseok Hyun, Hyunjoo Maeng, and Sangyeun Cho. 2014. The Multi-streamed Solid-State Drive. In 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14). USENIX Association.
[20] Kornilios Kourtis, Nikolas Ioannou, and Ioannis Koltsidas. 2019. Reaping the performance of fast NVM storage with uDepot. In 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, 1–15.
[21] Michael Larabel. Accessed: 2022-05-02. Axboe Achieves 8M IOPS PerCore With Newest Linux Optimization Patches. https://www.phoronix.com/scan.php?page=news_item&px=8M-IOPS-Per-Core-Linux.
[22] Gyusun Lee, Seokha Shin, Wonsuk Song, Tae Jun Ham, Jae W. Lee, and Jinkyu Jeong. 2019. Asynchronous I/O Stack: A Low-Latency Kernel I/O Stack for Ultra-Low Latency SSDs. In USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, 603–616.
[23] Jing Liu, Anthony Rebello, Yifan Dai, Chenhao Ye, Sudarsun Kannan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2021. Scale and Performance in a Filesystem Semi-Microkernel. In 28th Symposium on Operating Systems Principles (SOSP 21). ACM, 819–835.
[24] Linux Programmer’s Manual. Accessed: 2021-12-20. io_submit - submit asynchronous I/O blocks for processing. https://man7.org/linux/manpages/man2/io_submit.2.html.
[25] Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou, Yong Wang, and Yuanzheng Wang. 2014. SDF: Software-Defined Flash for Web-Scale Internet Storage Systems. SIGPLAN Not. 49, 4 (Feb 2014), 471–484.
[26] Anastasios Papagiannis, Giorgos Saloustros, Manolis Marazakis, and Angelos Bilas. 2017. Iris: An Optimized I/O Stack for Low-Latency Storage Devices. SIGOPS Oper. Syst. Rev. 50, 2 (Jan 2017), 3–11.
[27] PingCAP-Hackthon2019-Team17. Accessed: 2021-12-20. IO-uring speed the RocksDB & TiKV. http://openinx.github.io/ppt/io-uring.pdf.
[28] Ruslan Savchenko. 2021. Reading from External Memory.arXiv:cs.DC/2102.11198
[29] Livio Soares and Michael Stumm. 2010. FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10). USENIX Association, 33–46.
[30] Amy Tai, Igor Smolyar, Michael Wei, and Dan Tsafrir. 2021. Optimizing Storage Performance with Calibrated Interrupts. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 129–145.
[31] Animesh Trivedi, Nikolas Ioannou, Bernard Metzler, Patrick Stuedi, Jonas Pfefferle, Kornilios Kourtis, Ioannis Koltsidas, and Thomas R. Gross. 2018. FlashNet: Flash/Network Stack Co-Design. ACM Trans. Storage 14, 4, Article 30 (Dec 2018), 29 pages.
[32] Animesh Trivedi, Patrick Stuedi, Bernard Metzler, Roman Pletka, Blake G. Fitch, and Thomas R. Gross. 2013. Unified High-Performance I/O: One Stack to Rule Them All. In 14th Workshop on Hot Topics in Operating Systems (HotOS 14). USENIX Association.
[33] Vishal Verma, John Kariuki. Accessed: 2021-12-20. Improved Storage Performance Using the New Linux Kernel I/O Interface. https://www.snia.org/educational-library/improved-storageperformance-using-new-linux-kernel-io-interface-2019.
[34] Wander Hillen. Accessed: 2021-12-20. Preliminary benchmarking results for a Haskell I/O manager backend based on io_uring. http: //wjwh.eu/posts/2020-07-26-haskell-iouring-manager.html.
[35] Michael Wei, Matias Bjørling, Philippe Bonnet, and Steven Swanson. 2014. I/O Speculation for the Microsecond Era. In USENIX Annual Technical Conference (USENIX ATC 14). USENIX Association, 475–481.
[36] WiredTiger. Accessed: 2021-12-20. Implement asynchronous IO using io_uring API. https://jira.mongodb.org/browse/WT-6833.
[37] Shuai Xue, Shang Zhao, Quan Chen, Gang Deng, Zheng Liu, Jie Zhang, Zhuo Song, Tao Ma, Yong Yang, Yanbo Zhou, Keqiang Niu, Sijie Sun, and Minyi Guo. 2020. Spool: Reliable Virtualized NVMe Storage Pool in Public Cloud Infrastructure. In USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 97–110.
[38] Jisoo Yang, Dave B. Minturn, and Frank Hady. 2012. When Poll Is Better than Interrupt. In 10th USENIX Conference on File and Storage Technologies (FAST 12). USENIX Association.
[39] Young Jin Yu, Dong In Shin, Woong Shin, Nae Young Song, Jae Woo Choi, Hyeong Seog Kim, Hyeonsang Eom, and Heon Young Yeom 2014. Optimizing the Block I/O Subsystem for Fast Storage Devices. ACM Trans. Comput. Syst. 32, 2, Article 6 (Jun 2014), 48 pages.
[40] Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut Taylan Kandemir, Nam Sung Kim, Jihong Kim, and Myoungsoo Jung. 2018. FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, 477–492.
[41] Xiantao Zhang, Xiao Zheng, Zhi Wang, Hang Yang, Yibin Shen, and Xin Long. 2020. High-density Multi-tenant Bare-metal Cloud. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 20). ACM, 483–495.

Notes: IBM is a trademark of International Business Machines Corporation, registered in many jurisdictions worldwide. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other products and service names might be trademarks of IBM or other companies.