Linux 系统性能分析之CPU平均负载

原创

入九天 2020-06-21 12:25:08 ©著作权

文章标签 Linux 系统性能分析 CPU 平均负 文章分类 运维

©著作权归作者所有：来自51CTO博客作者入九天的原创作品，请联系作者获取转载授权，否则将追究法律责任

注:不管能否解决你遇到的问题，欢迎交流，共同提高, QQ:931389630！

1、平均负载

System load averages is the average number of processes that are either in a runnable or uninterruptable state.

A process in a runnable state is either using the CPU or waiting to use the CPU.

A process in uninterruptable state is waiting for some I/O access, eg waiting for disk.

The averages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.

平均负载最理想的情况是等于CPU个数。

查看CPU个数:

cat /proc/cpuinfo |grep "model name"|wc -l

cat /proc/cpuinfo |grep "processor"|wc -l

在实际生产环境中，以我看来，当平均负载高于 CPU 数量70%的时候，就应该分析排查负载高的问题了。一旦负载过高，就可能导致进程响应变慢，进而影响服务的正常功能。但70%这个数字并不是绝对的，最推荐的方法，还是把系统的平均负载监控起来，然后根据更多的历史数据，判断负载的变化趋势。当发现负载有明显升高趋势时，比如说负载翻倍了，你再去做分析和调查。

机器提前准备安装好stress和sysstat

场景一:CPU密集型进程

首先第一个终端，用stress模拟一个CPU使用率为100%的场景

stress --cpu 1 --timeout 60

然后在第二个终端，用uptime查看平均负载的变化情况(-d 参数表示高亮显示变化的区域)

watch -d uptime

其次在第三个终端，用mpstat查看CPU使用率的变化情况(-P ALL 表示监控所有CPU，后面2表示2秒后输出，共输出3次)

root@andy:~# mpstat -P ALL 2 3

Linux 4.18.0-12-generic (andy) 2020年05月17日 _x86_64_ (4 CPU)

......................................

14时51分50秒 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle

14时51分52秒 all 24.97 0.00 0.12 0.00 0.00 0.12 0.00 0.00 0.00 74.78

14时51分52秒 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

14时51分52秒 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

14时51分52秒 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

14时51分52秒 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

..........................................

平均时间: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle

平均时间: all 25.05 0.00 0.29 0.00 0.00 0.25 0.00 0.00 0.00 74.41

平均时间: 0 0.00 0.00 0.34 0.00 0.00 0.34 0.00 0.00 0.00 99.33

平均时间: 1 0.00 0.00 0.17 0.00 0.00 0.50 0.00 0.00 0.00 99.33

平均时间: 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

平均时间: 3 0.00 0.00 0.33 0.00 0.00 0.17 0.00 0.00 0.00 99.50

从终端二可以看到1分钟平均负载会慢慢增加到1.00，从终端三可以看到，有一个CPU的使用率为100%，其idle为0，iowait为0,。这说明平均负载的升高是由于CPU使用率为100%，下面使用pidstat来查看哪个进程导致的:

root@andy:~# pidstat -u 2 3

Linux 4.18.0-12-generic (andy) 2020年05月17日 _x86_64_ (4 CPU)

.................................................

15时15分50秒 UID PID %usr %system %guest %wait %CPU CPU Command

15时15分52秒 0 2890 0.00 0.50 0.00 0.00 0.50 3 sshd

15时15分52秒 0 4730 0.00 0.50 0.00 0.00 0.50 1 kworker/1:1-mm_percpu_wq

15时15分52秒 0 6489 99.50 0.00 0.00 0.00 99.50 2 stress

15时15分52秒 0 6523 0.50 0.00 0.00 0.00 0.50 0 pidstat

..................................................

平均时间: UID PID %usr %system %guest %wait %CPU CPU Command

平均时间: 0 2890 0.00 0.10 0.00 0.00 0.10 - sshd

平均时间: 0 4730 0.00 0.10 0.00 0.00 0.10 - kworker/1:1-events

平均时间: 0 6445 0.00 0.10 0.00 0.00 0.10 - kworker/u256:1-events_freezable_power_

平均时间: 0 6489 99.80 0.00 0.00 0.00 99.80 - stress

平均时间: 0 6523 0.30 0.60 0.00 0.00 0.90 - pidstat

可以看出stress进程的CPU使用率接近100%

场景二:IO密集型进程

首先第一个终端用stress模拟IO压力，即不停的执行sync:

stress -i 1 --timeout 60

然后在第二个终端，用uptime查看平均负载的变化情况(-d 参数表示高亮显示变化的区域)

watch -d uptime

其次在第三个终端，用mpstat查看CPU使用率的变化情况(-P ALL 表示监控所有CPU，后面2表示2秒后输出，共输出3次)

root@andy:~# mpstat -P ALL 2 3

Linux 4.18.0-12-generic (andy) 2020年05月17日 _x86_64_ (4 CPU)

.........................

16时47分09秒 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle

16时47分11秒 all 0.75 0.00 24.62 0.00 0.00 0.12 0.00 0.00 0.00 74.50

16时47分11秒 0 1.00 0.00 0.00 0.00 0.00 0.50 0.00 0.00 0.00 98.51

16时47分11秒 1 2.00 0.00 98.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

16时47分11秒 2 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 99.50

16时47分11秒 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

.............................

平均时间: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle

平均时间: all 0.71 0.00 24.55 0.00 0.00 1.29 0.00 0.00 0.00 73.45

平均时间: 0 0.33 0.00 0.83 0.00 0.00 0.17 0.00 0.00 0.00 98.66

平均时间: 1 2.34 0.00 97.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00

平均时间: 2 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 99.66

平均时间: 3 0.00 0.00 0.00 0.00 0.00 4.85 0.00 0.00 0.00 95.15

用pidstat查看哪个进程导致负载升高

root@andy:~# pidstat -u 2 3

Linux 4.18.0-12-generic (andy) 2020年05月17日 _x86_64_ (4 CPU)

...............

16时48分12秒 UID PID %usr %system %guest %wait %CPU CPU Command

16时48分14秒 0 3196 0.00 1.50 0.00 0.00 1.50 0 kworker/0:3-events

16时48分14秒 0 3405 0.00 1.50 0.00 0.00 1.50 2 kworker/2:2-mpt_poll_0

16时48分14秒 0 14984 2.00 98.00 0.00 0.00 100.00 1 stress

16时48分14秒 0 15000 0.00 1.00 0.00 0.00 1.00 2 pidstat

.........................

平均时间: UID PID %usr %system %guest %wait %CPU CPU Command

平均时间: 125 1439 0.00 0.17 0.00 0.00 0.17 - mysqld

平均时间: 123 1714 0.33 0.00 0.00 0.00 0.33 - gnome-shell

平均时间: 0 3196 0.00 0.50 0.00 0.00 0.50 - kworker/0:3-events

平均时间: 0 3405 0.00 0.50 0.00 0.00 0.50 - kworker/2:2-mpt_poll_0

平均时间: 0 5895 0.00 0.17 0.00 0.00 0.17 - watch

平均时间: 0 14984 1.32 98.34 0.00 0.00 99.67 - stress

平均时间: 0 15000 0.00 0.99 0.00 0.00 0.99 - pidstat

综合来看还是CPU的问题。

场景三:多进程的场景

当系统中运行的进程超过CPU运行能力时，就会出现等待CPU的进程

stress -c 6 --timeout 60

然后在第二个终端，用uptime查看平均负载的变化情况(-d 参数表示高亮显示变化的区域)

系统有4个CPU，因此系统的CPU处于过载状态，平均负载高达6.03

watch -d uptime

16:57:41 up 3:02, 3 users, load average: 6.03, 3.05, 1.32

其次在第三个终端，用mpstat查看CPU使用率的变化情况(-P ALL 表示监控所有CPU，后面2表示2秒后输出，共输出3次)

root@andy:~# mpstat -P ALL 2 3

Linux 4.18.0-12-generic (andy) 2020年05月17日 _x86_64_ (4 CPU)

..........................

17时00分45秒 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle

17时00分47秒 all 99.75 0.00 0.13 0.00 0.00 0.13 0.00 0.00 0.00 0.00

17时00分47秒 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

17时00分47秒 1 99.49 0.00 0.00 0.00 0.00 0.51 0.00 0.00 0.00 0.00

17时00分47秒 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

17时00分47秒 3 98.98 0.00 1.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00

....................

平均时间: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle

平均时间: all 99.66 0.00 0.30 0.00 0.00 0.04 0.00 0.00 0.00 0.00

平均时间: 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

平均时间: 1 99.48 0.00 0.34 0.00 0.00 0.17 0.00 0.00 0.00 0.00

平均时间: 2 99.67 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00

平均时间: 3 99.32 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00

明显看出stress线程等待CPU的使用

root@andy:~# pidstat -u 2 3

Linux 4.18.0-12-generic (andy) 2020年05月17日 _x86_64_ (4 CPU)

..................

17时02分33秒 UID PID %usr %system %guest %wait %CPU CPU Command

17时02分35秒 0 2890 0.00 0.50 0.00 0.00 0.50 1 sshd

17时02分35秒 0 15605 59.50 0.00 0.00 40.50 59.50 2 stress

17时02分35秒 0 15606 66.00 0.00 0.00 34.00 66.00 1 stress

17时02分35秒 0 15607 67.00 0.50 0.00 33.00 67.50 0 stress

17时02分35秒 0 15608 67.00 0.00 0.00 33.00 67.00 3 stress

17时02分35秒 0 15609 71.00 0.00 0.00 29.50 71.00 2 stress

17时02分35秒 0 15610 66.00 0.50 0.00 34.50 66.50 1 stress

17时02分35秒 0 16325 0.50 2.50 0.00 0.50 3.00 2 pidstat

......................

平均时间: UID PID %usr %system %guest %wait %CPU CPU Command

平均时间: 0 2890 0.00 0.16 0.00 0.00 0.16 - sshd

平均时间: 0 5895 0.00 0.16 0.00 0.00 0.16 - watch

平均时间: 0 15296 0.00 0.16 0.00 0.33 0.16 - kworker/u256:1-events_unbound

平均时间: 0 15605 61.12 0.00 0.00 38.55 61.12 - stress

平均时间: 0 15606 66.56 0.16 0.00 33.28 66.72 - stress

平均时间: 0 15607 69.19 0.16 0.00 30.48 69.36 - stress

平均时间: 0 15608 67.22 0.16 0.00 32.29 67.38 - stress

平均时间: 0 15609 68.04 0.00 0.00 31.96 68.04 - stress

平均时间: 0 15610 62.60 0.16 0.00 37.73 62.77 - stress

平均时间: 0 16325 0.33 2.47 0.00 0.49 2.80 - pidstat

小结:

平均负载能快速查看系统整体性能，反映了整体的负载情况。但只看平均负载本身，并不能直接发现，到底是哪里出现了瓶颈。所以，在理解平均负载时，要注意

1、平均负载高有可能是CPU密集型进程导致的

2、平均负载高不一定代表CPU使用率高，还有可能是IO繁忙导致

3、当平均负载较高时，可以使用mpstat、pidstat等工具，辅助分析负载的来源。

上一篇：详解FTP服务之vsftpd(附三种用户安装脚本)

下一篇：系统性能分析之CPU上下文切换

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯