IO调度器的总体目标是希望让磁头能够总是往一个方向移动,移动到底了再往反方向走,这恰恰就是现实生活中的电梯模型,所以IO调度器也被叫做电梯.(elevator)而相应的算法也就被叫做电梯算法.而Linux中IO调度的电梯算法有好几种,一个叫做as(Anticipatory),一个叫做cfq(Complete Fairness Queueing),一个叫做deadline,还有一个叫做noop(No Operation).具体使用哪种算法我们可以在启动的时候通过内核参数elevator来指定.
另一方面我们也可以单独的为某个设备指定它所采用的IO调度算法,这就通过修改在/sys/block/sda/queue/目录下面的scheduler文件.比如我们可以先看一下我的这块硬盘:
[root@localhost ~]# cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
可以看到我们这里采用的是cfq.
2.6内核的四种调度算法
In the 2.6 kernel series, there are four interchangeable schedulers, as follows:
cfq- “Completely Fair Queuing” makes a good default for most workloads on general-purpose servers.
as - “Anticipatory Scheduler” is best for workstations and other systems with slow, single-spindle storage.
deadline - “Deadline” is a relatively simple scheduler which tries to minimize I/O latency by re-ordering requests to improve performance.
noop- “NOOP” is the most simple scheduler of all, and is really just a single FIFO queue.
修改默认的IO调度算法
There are two ways to change the I/O scheduler - at boot time, or with new kernels at runtime. For all Linux kernels, appending ‘elevator={noop|deadline}’ to the kernel boot string sets the I/O elevator.
With LILO, you can use the ‘append’ keyword:
image=/boot/vmlinuz-2.6.14.2
label=14.2
append=”elevator=deadline”
read-only
optional
With GRUB, append the string to the end of the kernel command:
title Fedora Core (2.6.9-5.0.3.EL_lustre.1.4.2custom)
root (hd0,0)
kernel /vmlinuz-2.6.9-5.0.3.EL_lustre.1.4.2custom ro
root=/dev/VolGroup00/LogVol00 rhgb noapic quiet elevator=deadline
With newer Linux kernels (Red Hat Enterprise Linux v3 Update 3 does not have this feature. It is present in the main Linux tree as of 2.6.15), one can change the scheduler while running. If the file /sys/block/<DEVICE>/queue/scheduler exists (where DEVICE is the block device you wish to affect), it will contain a list of available schedulers and can be used to switch the schedulers.
(hda is the <disk>):
[root@cfs2]# cat /sys/block/hda/queue/scheduler
noop [anticipatory] deadline cfq
[root@cfs2 ~]# echo deadline > /sys/block/hda/queue/scheduler
[root@cfs2 ~]# cat /sys/block/hda/queue/scheduler
noop anticipatory [deadline] cfq
The other schedulers (anticipatory and cfq) are better suited for desktop use.
查看当前系统支持的IO调度算法
dmesg | grep -i scheduler
cat /sys/block/sda/queue/scheduler ##查看系统使用的IO调度机制
noop anticipatory deadline [cfq]
其中用[ ]起来的,表示目前系统正使用的IO调弃机制,当前系统支持noop anticipatory deadline 和cfq
echo cfq >/sys/block/sda/queue/scheduler 来让更改及时生效
在2.6的内核中共计有4种IO调度机制:
1.Deadline scheduler Deadline scheduler 用 deadline 算法保证对于既定的 IO 请求以最小的延迟时间,从这一点理解,对于 DSS 应用应该会是很适合的。
2.Anticipatory scheduler(as) 曾经一度是 Linux 2.6 Kernel 的 IO scheduler 。Anticipatory 的中文含义是”预料的, 预想的”, 这个词的确揭示了这个算法的特点,简单的说,有个 IO 发生的时候,如果又有进程请求 IO 操作,则将产生一个默认的 6 毫秒猜测时间,猜测下一个 进程请求 IO 是要干什么的。这对于随即读取会造成比较大的延时,对数据库应用很糟糕,而对于 Web Server 等则会表现的不错。这个算法也可以简单理解为面向低速磁盘的,因为那个”猜测”实际上的目的是为了减少磁头移动时间。
3.Completely Fair Queuing 虽然这世界上没有完全公平的事情,但是并不妨碍开源爱好者们设计一个完全公平的 IO 调度算法。Completely Fair Queuing (cfq, 完全公平队列) 在 2.6.18 取代了 Anticipatory scheduler 成为 Linux Kernel 默认的 IO scheduler 。cfq 对每个进程维护一个 IO 队列,各个进程发来的 IO 请求会被 cfq 以轮循方式处理。也就是对每一个 IO 请求都是公平的。这使得 cfq 很适合离散读的应用(eg: OLTP DB)。我所知道的企业级 Linux 发行版中,SuSE Linux 好像是最先默认用 cfq 的.
4.NOOP Noop 对于 IO 不那么操心,对所有的 IO请求都用 FIFO 队列形式处理,默认认为 IO 不会存在性能问题。这也使得 CPU 也不用那么操心。当然,对于复杂一点的应用类型,使用这个调度器,用户自己就会非常操心。

The Linux kernel, the core of the operating system, is responsible for controlling disk access by using kernel I/O scheduling. Red Hat Enterprise Linux 3 with a 2.4 kernel base uses a single, robust, general purpose I/O elevator. The 2.4 I/O scheduler has a reasonable number of tuning options by controlling the amount of time a request remains in an I/O queue before being serviced using the elvtunecommand. While Red Hat Enterprise Linux 3 offers most workloads excellent performance, it does not always provide the best I/O characteristics for the wide range of applications in use by Linux users these days. The I/O schedulers provided in Red Hat Enterprise Linux 4, embedded in the 2.6 kernel, have advanced the I/O capabilities of Linux significantly. With Red Hat Enterprise Linux 4, applications can now optimize the kernel I/O at boot time, by selecting one of four different I/O schedulers to accommodate different I/O usage patterns:

  • Completely Fair Queuing—elevator=cfq (default)
  • Deadline—elevator=deadline
  • NOOP—elevator=noop
  • Anticipatory—elevator=as

Add the elevator options from Table 1 to your kernel command in the GRUB boot loader configuration file (/boot/grub/grub.conf) or the eLILO command line. Red Hat Enterprise Linux 4 has all four elevators built-in; no need to rebuild your kernel.

The 2.6 kernel incorporates the best I/O algorithms that developers and researchers have shared with the open-source community as of mid-2004. These schedulers have been available in Fedora Core 3 and will continue to be used in Fedora Core 4. There have been several good characterization papers on using evaluating Linux 2.6 I/O schedulers. A few are referenced at the end of this article. This article details our own study based on running Oracle 10G in both OLTP and DSS workloads with EXT3 file systems.

Red Hat Enterprise Linux 4 I/O schedulers

Included in Red Hat Enterprise Linux 4 are four custom configured schedulers from which to choose. They each offer a different combination of optimizations.

The Completely Fair Queuing (CFQ) scheduler is the default algorthim in Red Hat Enterprise Linux 4. As the name implies, CFQ maintains a scalable per-process I/O queue and attempts to distribute the available I/O bandwidth equally among all I/O requests. CFQ is well suited for mid-to-large multi-processor systems and for systems which require balanced I/O performance over multiple LUNs and I/O controllers.

The Deadline elevator uses a deadline algorithm to minimize I/O latency for a given I/O request. The scheduler provides near real-time behavior and uses a round robin policy to attempt to be fair among multiple I/O requests and to avoid process starvation. Using five I/O queues, this scheduler will aggressively re-order requests to improve I/O performance.

The NOOP scheduler is a simple FIFO queue and uses the minimal amount of CPU/instructions per I/O to accomplish the basic merging and sorting functionality to complete the I/O. It assumes performance of the I/O has been or will be optimized at the block device (memory-disk) or with an intelligent HBA or externally attached controller.

The Anticipatory elevator introduces a controlled delay before dispatching the I/O to attempt to aggregate and/or re-order requests improving locality and reducing disk seek operations. This algorithm is intended to optimize systems with small or slow disk subsystems. One artifact of using the AS scheduler can be higher I/O latency.

Choosing an I/O elevator

The definitions above may give enough information to make a choice for your I/O scheduler. The other extreme is to actually test and tune your workload on each I/O scheduler by simply rebooting your system and measuring your exact environment. We have done just that for Red Hat Enterprise Linux 3 and all four Red Hat Enterprise Linux 4 I/O schedulers using an Oracle 10G I/O workloads.

Figure 1 shows the results of running an Oracle 10G OLTP workload running on a 2-CPU/2-HT Xeon with 4 GB of memory across 8 LUNs on an LSIlogic megraraid controller. The OLTP load ran mostly 4k random I/O with a 50% read/write ratio. The DSS workload consists of 100% sequential read queries using large 32k-256k byte transfer sizes.

修改LINUX <wbr>IO调度算法 <wbr>提高数据库吞吐
Figure 1. Red Hat Enterprise Linux 4 IO schedulers vs. Red Hat Enterprise Linux 3 for database Oracle 10G oltp/dss (relative performance)

The CFQ scheduler was chosen as the default since it offers the highest performance for the widest range of applications and I/O system designs. We have seen CFQ excel in both throughput and latency on multi-processor systems with up to 16-CPUs and for systems with 2 to 64 LUNs for both UltraSCSI and Fiber Channel disk farms. In addition, CFQ is easy to tune by adjusting the nr_requests parameter in /proc/sys/scsi subsystem to match the capabilities of any given I/O subsystem.

The Deadline scheduler excelled at attempting to reduce the latency of any given single I/O for real-time like environments. A problem which depends on an even balance of transactions across multiple HBA, drives or multiple file systems may not always do best with the Deadline scheduler. The Oracle 10G OLTP load using 10 simultaneous users spread over eight LUNs showed improvement using Deadline relative to Red Hat Enterprise Linux 3's I/O elevator, but was still 12.5% lower than CFQ.

The NOOP scheduler indeed freed up CPU cycles but performed 23% fewer transactions per minute when using the same number of clients driving the Oracle 10G database. The reduction in CPU cycles was proportional to the drop in performance, so perhaps this scheduler may work well for systems which drive their databases into CPU saturation. But CFQ or Deadline yield better throughput for the same client load than the NOOP scheduler.

The AS scheduler excels on small systems which have limited I/O configurations and have only one or two LUNs. By design, the AS scheduler is a nice choice for client and workstation machines where interactive response time is a higher priority than I/O latency.

Summary: Have it your way!

The short summary of our study indicates that there is no SINGLE answer to which I/O scheduler is best. The good news is that with Red Hat Enterprise Linux 4 an end-user can customize their scheduler with a simple boot option. Our data suggests the default Red Hat Enterprise Linux 4 I/O scheduler, CFQ, provides the most scalable algorithm for the widest range of systems, configurations, and commercial database users. However, we have also measured other workloads whereby the Deadline scheduler out-performed CFQ for large sequential read-mostly DSS queries. Other studies referenced in the section "References" explored using the AS scheduler to help interactive response times. In addition, noop has proven to free up CPU cycles and provide adequate I/O performance for systems with intelligent I/O controller which provide their own I/O ordering capabilities.

In conclusion, we recommend baselining an application with the default CFQ. Use this article and its references to match your application to one of the studies. Then adjust the I/O scheduler via the simple command line re-boot option if seeking additional performance. Make only one change at a time, and use performance tools to validate the results.

References

Axboe, J., "Deadline I/O Scheduler Tunables, SuSE, EDF R&D, 2003.

Braswell, B., Ciliendo, E., "Tuning Red Hat Enterprise Linux on IBMeServer xSeries Servers", ibm.com/redbooks.

Corbet, J., "The Continuing Development of I/O Scheduling", http://lwn.net/Articles/21274.

Heger, D., Pratt, S., "Workload Dependent Performance Evaluation of the Linux 2.6 I/O Schedulers", Linux Symposium, Ottawa, Canada, July 2004.

Likins, Adrian. "System Tuning Info for Linux Servers", http://people.redhat.com/alikins/system_tuning.html

About the author

D. John Shakshober is a Consulting Engineer for Red Hat in Westford, MA focusing on kernel and benchmark performance. Prior to Red Hat, John was Technical Director of Performance Engineering at HP, Compaq, and Digital, working on Linux and Tru64 Unix benchmark performance engineering in Nashua NH. He has an M.S. in Electrical Engineering from Cornell University and a B.S. in Computer Engineering from Rochester Institute of Technology.