案例分析平均负载与CPU使用率

原创

wx5bcd2f496a1cf 2022-08-17 01:42:30 博主文章分类：Linux 性能优化 ©著作权

©著作权归作者所有：来自51CTO博客作者wx5bcd2f496a1cf的原创作品，请联系作者获取转载授权，否则将追究法律责任

案例分析平均负载与CPU使用率_linux

平均负载与CPU使用率

现实工作中，我们经常容易把平均负载和CPU使用率混淆，所以在这里，我也做一个区分。

可能你会疑惑，平均均负载代表的是活跃进数，那平均负载高了，不就意味着CPU使用率高吗？

我们还是要回到平均负载的含义上来，平均负载是指单位时间内，处于可运行状态和不可中断状态的进程数。所以，它不仅包括了正在使用CPU的进程，还包括等待CPU和等待I/O的进程。而CPU使用率，是单位时间内CPU繁忙情况的统计，跟平均负载并不一定完全对应。比如：

• CPU密集型进程，使用大量CPU会导致平均员载升高，此时这两者是一致的。

• I/O密集型进程，等待I/O也会导致平均负载升高，但CPU使用率不一定很高。

• 大量等待CPU的进程调度也会导致平均负载升高，此时的CPU使用率也会比较高。

平均负载案例分析

下面，我们以三个示例分别来看这三种情况，并用iostat、mpstat、pidstat等工具，找出平均负载升高的根源。

下面的案例都是基于centos 7.4,当然，同样适用于其他Linux系统。我使用的案例环境如下所示：

• 机器配置：2 CPU, 1GB内存

#2个物理CPU
[root@localhost ~]# cat /proc/cpuinfo | grep 'physical id'
physical id : 0
physical id : 2

• 预先安装 stress 和 sysstat 包，如 apt install stress sysstat

  yum install epel*
  yum install stress -y
  yum install stystat -y

在这里，我先简单介绍—下stress和sysstat.

stress是一个Linux系统压力测试工具，这里我们用作异常进程模拟平均负载升高的场景。

而sysstat包含了常用的Linux性能工具，用来监控和分析系统的性能。我们的案例会用到这个包的两个命令mpstat和pidstat.

mpstat是一个常用的多核CPU性能分析工具，用来实时查看每个CPU的性能指标，以及所有CPU的平均指标。
pidstat是一个常用的进程性能分析工具，用来实时查看进程的CPU、内存、I/O以及上下文切换等性能指标。

此外，每个场景都需要你开三个终端，登录到同一台Linux机器中。

如果上面的要求都已经完成了，你可以先用uptime命令，看一下测试前的平均负载情况:

[root@localhost ~]# uptime
 06:47:44 up 13 min,  1 user,  load average: 0.00, 0.01, 0.02

场景一：CPU密集型进程

首先，我们在第一个终端运行stress命令，模拟一个CPU使用率100%的场景:

[root@localhost ~]# stress --cpu 1 --timeout 600
stress: info: [1209] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

接着，在第二个终端运行uptime查看平均负载的变化情况:

# -d参数表示高亮显示变化的区域
[root@localhost ~]# watch -d uptime
Every 2.0s: uptime                                                                                                     Thu Jun 25 06:57:24 2020

 06:57:24 up 23 min,  3 users,  load average: 0.95, 0.63, 0.30

最后，在第三个终端运行mpstat查看CPU使用率的变化情况:

[root@localhost ~]# mpstat  -P ALL 1  2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain)   06/25/2020  _x86_64_  (2 CPU)

06:56:22 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:56:23 AM  all   49.75    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   50.25
06:56:23 AM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:56:23 AM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

06:56:23 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:56:24 AM  all   49.49    0.00    0.51    0.00    0.00    0.00    0.00    0.00    0.00   50.00
06:56:24 AM    0   43.43    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   56.57
06:56:24 AM    1   55.56    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   43.43

CPU  处理器号码。关键字ALL表示统计数据是以所有处理器之间的平均值计算的。
%usr 显示在用户级别(应用程序)执行时出现的CPU利用率百分比。
%nice 以良好的优先级在用户级别执行时显示CPU利用率的百分比。
%sys  显示在系统级(内核)执行时CPU利用率的百分比。请注意，这不包括用于服务硬件和软件中断的时间。
%iowait 在internal时间段里，硬盘IO等待时间（%）   iowait/total*100
%irq  显示cpu或cpu用于服务硬件中断的时间百分比。
%soft  显示CPU或CPU用于服务软件中断的时间百分比。
%steal  显示虚拟机管理程序为另一个虚拟处理器服务时，虚拟CPU或CPU在非自愿等待中花费的时间百分比。
%guest  显示CPU或cpu运行虚拟处理器所花费的时间百分比。
%idle  显示CPU或CPU空闲的时间百分比，并且系统没有未执行的磁盘I/O请求。

从终端二中可以看到，1分钟的平均负载会慢慢增加到1.00，而从终端三中还可以看到，正好有1个CPU的使用率为100%，但它的iowait只有0。这说明，平均负载的升高正是由于 CPU使用率为100%。

那么，到底是哪个进程导致了 CPU使用率为100%呢？你可以使用pidstat来查询:

[root@localhost ~]# pidstat 1 2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain)   06/25/2020  _x86_64_  (2 CPU)

06:59:29 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
06:59:30 AM     0      1210   98.02    0.00    0.00   98.02     0  stress

06:59:30 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
06:59:31 AM     0      1210  100.00    0.00    0.00  100.00     1  stress
06:59:31 AM     0      1678    0.00    0.99    0.00    0.99     0  pidstat

Average:      UID       PID    %usr %system  %guest    %CPU   CPU  Command
Average:        0      1210   99.01    0.00    0.00   99.01     -  stress
Average:        0      1678    0.00    0.50    0.00    0.50     -  pidstat

从这里可以明显看到，stress 进程的 CPU 使用率为 100%。

场景二：I/O密集型进程

首先还是运行stress命令，但这次模拟I/O压力，即不停地执行sync:

#-i --io  产生n个进程 每个进程反复调用sync()，sync()用于将内存上的内容写到硬盘上
[root@localhost ~]# stress --io  1 --timeout 600
stress: info: [1935] dispatching hogs: 0 cpu, 1 io, 0 vm, 0 hdd

还是在第二个终端运行uptime查看平均负载的变化情况:

[root@localhost ~]# watch -d uptime
Every 2.0s: uptime                                                                                                                                                                             Thu Jun 25 07:06:58 2020
 07:06:58 up 32 min,  3 users,  load average: 1.03, 0.75, 0.50

然后，第三个终端运行mpstat查看CPU使用率的变化情况:

#显示所有CPU的指标，并在间隔5秒输出一组数据
[root@localhost ~]# mpstat -P ALL 5 2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain)   06/25/2020  _x86_64_  (2 CPU)
4 13:41:28  CPU %usr  %nice %sys  %iowait %irq  %soft %steal  %guest  %gnice  %idle
13:41:33  all 0.21  0.00  12.07 32.67 0.00  0.21  0.00  0.00  0.00  54.84
13:41:33  0 0.43  0.00  23.87 67.53 0.00  0.43  0.00  0.00  0.00  7.74
13:41:33  1 0.00  0.00  0.81  0.20  0.00  0.00  0.00  0.00  0.00  98.99

从这里可以看到，1分钟的平均负载会慢慢增加到1.03,其中一个CPU的系统CPU使用率升高到了 23.87,而iowait高达67.53%。这说明，平均负载的升高是由于iowait的升高。

那么到底是哪个进程，导致iowait这么高呢？我们还是用pidstat来查询:

[root@localhost ~]# pidstat 2 2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain)   06/25/2020  _x86_64_  (2 CPU)

07:20:43 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:20:45 AM     0      2838  100.00    0.00    0.00  100.00     0  stress
07:20:45 AM     0      2991    0.00    0.50    0.00    0.50     1  pidstat

07:20:45 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:20:47 AM     0       409    0.00    0.50    0.00    0.50     0  xfsaild/dm-0
07:20:47 AM     0      1099    0.00    0.50    0.00    0.50     0  sshd
07:20:47 AM     0      2019    0.50    0.00    0.00    0.50     0  watch
07:20:47 AM     0      2838   98.50    0.00    0.00   98.50     1  stress
07:20:47 AM     0      2951    0.00    0.50    0.00    0.50     0  kworker/0:0

可以发现，还是stress进程导致的。

场景三：大量进程的场景

当系统中运行进程超出CPU运行能力时，就会出现等待CPU的进程，比如，我们还是使用Stress，但这次模拟的是10个进程：

[root@localhost ~]# stress -c  10 --timeout 600
stress: info: [3356] dispatching hogs: 10 cpu, 0 io, 0 vm, 0 hdd

由于系统只有2个CPU,明显比10个进程要少得多，因而，系统的CPU处于严重过载状态, 平均负载高达9.78

[root@localhost ~]# watch -d uptime

Every 2.0s: uptime                                                                                                                                                                             Thu Jun 25 07:30:02 2020

 07:30:02 up 55 min,  3 users,  load average: 9.78, 5.52, 2.91

接着再运行pidstat来看一下进程的情况:

[root@localhost ~]# pidstat  2 2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain)   06/25/2020  _x86_64_  (2 CPU)

07:32:06 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:32:08 AM     0      3357   19.90    0.00    0.00   19.90     1  stress
07:32:08 AM     0      3358   19.40    0.50    0.00   19.90     0  stress
07:32:08 AM     0      3359   19.90    0.00    0.00   19.90     0  stress
07:32:08 AM     0      3360   19.90    0.00    0.00   19.90     1  stress
07:32:08 AM     0      3361   19.90    0.00    0.00   19.90     1  stress
07:32:08 AM     0      3362   19.90    0.00    0.00   19.90     0  stress
07:32:08 AM     0      3363   19.90    0.00    0.00   19.90     1  stress
07:32:08 AM     0      3364   19.90    0.00    0.00   19.90     0  stress
07:32:08 AM     0      3365   19.90    0.00    0.00   19.90     0  stress
07:32:08 AM     0      3366   19.90    0.00    0.00   19.90     1  stress
07:32:08 AM     0      3671    0.00    0.50    0.00    0.50     1  pidstat

07:32:08 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:32:10 AM     0         9    0.00    0.50    0.00    0.50     1  rcu_sched
07:32:10 AM     0      3357   19.90    0.00    0.00   19.90     1  stress
07:32:10 AM     0      3358   20.40    0.00    0.00   20.40     0  stress
07:32:10 AM     0      3359   19.90    0.00    0.00   19.90     0  stress
07:32:10 AM     0      3360   19.90    0.00    0.00   19.90     1  stress
07:32:10 AM     0      3361   19.90    0.00    0.00   19.90     1  stress
07:32:10 AM     0      3362   19.90    0.00    0.00   19.90     0  stress
07:32:10 AM     0      3363   19.90    0.00    0.00   19.90     1  stress
07:32:10 AM     0      3364   19.90    0.00    0.00   19.90     0  stress
07:32:10 AM     0      3365   19.40    0.00    0.00   19.40     0  stress
07:32:10 AM     0      3366   19.40    0.00    0.00   19.40     1  stress
07:32:10 AM     0      3548    0.50    0.00    0.00    0.50     0  watch

详细说明:
PID：进程ID
%usr：进程在用户空间占用cpu的百分比
%system：进程在内核空间占用cpu的百分比
%guest：进程在虚拟机占用cpu的百分比
%CPU：进程占用cpu的百分比
CPU：处理进程的cpu编号
Command：当前进程对应的命令