perf 是Linux的一款性能分析工具,能够进行函数级和指令级的热点查找,可以用来分析程序中热点函数的CPU占用率,从而定位性能瓶颈。

Performance analysis tools for Linux.

系统性能优化通常可以分为两个阶段:性能分析和性能优化。

  • 性能分析的目的是查找性能瓶颈、热点代码,分析引发性能问题的原因,包括评估程序对硬件资源的使用情况,例如各级cache的访问次数,各级cache的丢失次数、流水线停顿周期、前端总线访问次数等。评估程序对操作系统资源的使用情况,系统调用次数、上下文切换次数、任务迁移次数等。
  • 基于性能分析,可以进行性能优化,包括:算法优化(空间复杂度和时间复杂度的权衡)和代码优化(提高执行速度、减少内存占用)。

1.perf的原理

Linux性能计数器是一个基于内核的子系统,它提供一个性能分析框架,比如硬件(CPU、PMU(Performance Monitoring Unit))功能和软件(软件计数器、tracepoint)功能,tracepoints是预埋在内核源码中的一些hook,它们能够在特定的代码被执行到时触发,这一特定能够被各类trace/debug工具所使用。perf将tracepoint产生的时间记录下来,生成报告,这些tracepint的对应的sysfs节点在/sys/kernel/debug/tracing/events目录下。

通过perf,应用程序可以利用PMU、tracepoint和内核中的计数器来进行性能统计。

Perf 可以对众多软硬件事件(程序进行到函数级别)采样,还能采集出跟踪点(trace points)的信息(比如系统调用、TCP/IP事件和文件系统操作,从而了解程序的性能瓶颈在哪里。其基本原理是:每隔一个固定时间,就是CPU上产生一个中断,看当前是哪个进程、哪个函数,然后给对应的进程和函数加一个统计值,这样就知道CPU有多少时间在某个进程或某个函数上了。perf的代码和Linux内核代码放在一起,是内核级的工具。perf是在Linux上做剖析分析的首选工具。

2.perf命令介绍

perf支持的命令如下

root: /5g_build/5g_Main/$ perf -h

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

 

annotate:读取 perf.data(由 perf record记录)并显示带注释的代码,需要在编译应用程序时加入-g选项

archive:用perf.data文件中找到的build-ids的对象文件创建档案。

bench:对系统调度、内存访问、epoll、Futex等进行压力测试。

buildid-cache: 管理build-id缓存

buildid-list:列出perf.data文件中的buildids。

c2c:共享数据C2C/HITM分析仪。

config:读取或设置配置文件中的变量

data:数据文件相关处理

diff: 读取perf.data文件并显示差分曲线

ftrace:内核的ftrace功能的简单封装器

inject:用额外的信息来增加事件流的过滤器

kallsyms:搜索运行中的内核中的符号

kmem:追踪/测量内核内存属性的工具

kvm: 追踪/测量kvm客户操作系统的工具

list:列出所有象征性的事件类型

lock:分析锁事件

mem:分析内存访问

record:将所有的分析记录进perf.data

report:读取perf.data(由perf记录创建)并显示概况

sched:跟踪/测量调度器属性(延迟)的工具

script: 读取perf.data(由perf记录创建)并显示跟踪输出

stat:运行一个命令并收集性能计数器的统计数据

test:测试系统内核支持的功能

timechart:在工作负载期间可视化整个系统行为的工具

top:系统分析工具

probe:定义新的动态跟踪点

trace:strace启发的工具

perf list

列出所有能够触发perf采样点的事件:

List of pre-defined events (to be used in -e):

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  msr/pperf/                                         [Kernel PMU event]
  msr/smi/                                           [Kernel PMU event]
  msr/tsc/                                           [Kernel PMU event]
  power/energy-cores/                                [Kernel PMU event]
  power/energy-pkg/                                  [Kernel PMU event]
  power/energy-ram/                                  [Kernel PMU event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:<addr>[/len][:access]                          [Hardware breakpoint]

  block:block_bio_backmerge                          [Tracepoint event]
  block:block_bio_bounce                             [Tracepoint event]
  block:block_bio_complete                           [Tracepoint event]
  block:block_bio_frontmerge                         [Tracepoint event]
  block:block_bio_queue                              [Tracepoint event]
  block:block_bio_remap                              [Tracepoint event]
  block:block_dirty_buffer                           [Tracepoint event]
  block:block_getrq                                  [Tracepoint event]
  block:block_plug                                   [Tracepoint event]
  block:block_rq_abort                               [Tracepoint event]

主要有三类:

  • Hardware Event: 是由PMU硬件产生的事件,比如 cache 命中
  • Software Event:是内核软件产生的事件,比如进程切换,tick 数等
  • Tracepoint event:是内核中的静态 tracepoint 所触发的事件

perf stat

采集程序的运行时间和CPU开销,perf stat支持的主要参数:

-a, --all-cpus        system-wide collection from all CPUs
    -A, --no-aggr         disable CPU count aggregation
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -D, --delay <n>       ms to wait before starting measurement after program start (-1: start with events disabled)
    -d, --detailed        detailed run - start a lot of events
    -e, --event <event>   event selector. use 'perf list' to list available events
    -G, --cgroup <name>   monitor event in cgroup name only
    -g, --group           put the counters into a counter group
    -I, --interval-print <n>
                          print counts at regular interval in ms (overhead is possible for values <= 100ms)
    -i, --no-inherit      child tasks do not inherit counters
    -M, --metrics <metric/metric group list>
                          monitor specified metrics or metric groups (separated by ,)
    -n, --null            null run - dont start any counters
    -o, --output <file>   output file name
    -p, --pid <pid>       stat events on existing process id
    -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
    -S, --sync            call sync() before starting a run
    -t, --tid <tid>       stat events on existing thread id
    -T, --transaction     hardware transaction statistics
    -v, --verbose         be more verbose (show counter open errors, etc)

使用命令perf stat -p <pid>可以采集某一个进程的运行信息,<pid>可以通过top工具获得:

top - 16:51:29 up 24 days, 1 min,  2 users,  load average: 0.48, 0.68, 1.00
Tasks:  26 total,   2 running,  24 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.6 us,  1.2 sy,  0.0 ni, 85.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32763512 total, 14078456 used, 18685056 free,   842136 buffers
KiB Swap:        0 total,        0 used,        0 free. 10791840 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                      
31404 root      20   0   19988    892    684 R 100.0 0.003   0:08.62 test                                                                                                                         
16506 root      20   0  636592  26840  16040 S 1.333 0.082   2:44.31 node                                                                                                                         
 1103 root      20   0 1010144  42180  15692 S 0.667 0.129   1:37.42 node                                                                                                                         
 1072 root      20   0  998896 104500  18832 S 0.333 0.319   1:19.19 node                                                                                                                         
    1 root      20   0   26864   2044   1636 S 0.000 0.006   0:00.05 code-server-x86                                                                                                              
 1030 root      20   0   73888   3640   2872 S 0.000 0.011   0:00.00 sshd                                                                                                                         
 1033 root      20   0   35836   2216   1732 S 0.000 0.007   0:00.00 rsyncd                                                                                                                       
 1035 root      20   0   26864    932    528 S 0.000 0.003   0:00.00 code-server-x86                                                                                                              
 1036 root      20   0  205680  32388   6180 S 0.000 0.099   0:11.50 python3                                                                                                                      
 1038 root      20   0  129800  31800   5612 S 0.000 0.097   0:11.43 python3                                                                                                                      
 1042 root      20   0  723032  36572  16052 S 0.000 0.112   0:01.49 node                                                                                                                         
16465 root      20   0 1072336 163008  23064 S 0.000 0.498   0:48.81 node                                                                                                                         
16476 root      20   0  896504  98148  15460 S 0.000 0.300   0:13.43 node                                                                                                                         
16512 root      20   0  659364  84108  18608 S 0.000 0.257   0:03.60 node                                                                                                                         
16514 root      20   0  616684  23924  14844 S 0.000 0.073   0:01.01 node                                                                                                                         
16537 root      20   0  607840  29452  17456 S 0.000 0.090   0:00.97 node                                                                                                                         
16543 root      20   0  605776  25336  14788 S 0.000 0.077   0:00.91 node                                                                                                                         
16550 root      20   0  605728  27440  14792 S 0.000 0.084   0:00.91 node                                                                                                                         
16557 root      20   0  606312  25904  14848 S 0.000 0.079   0:00.94 node                                                                                                                         
27352 root      20   0 1595104 483140   7972 S 0.000 1.475   5:00.23 cpptools                                                                                                                     
29068 root      20   0 4902488   8660   3988 S 0.000 0.026   0:00.21 cpptools-srv                                                                                                                 
29529 root      20   0   30560   6060   2032 S 0.000 0.018   0:00.04 bash                                                                                                                         
31411 root      20   0   30456   5968   1940 S 0.000 0.018   0:00.03 bash                                                                                                                         
31455 root      20   0   28764   1740   1280 R 0.000 0.005   0:00.00 top                                                                                                                          
31475 root      20   0   26860   1972   1588 S 0.000 0.006   0:00.00 cpuUsage.sh                                                                                                                  
31479 root      20   0   21564    964    764 S 0.000 0.003   0:00.00 sleep                                                                                                                        





root: /5g_build/5g_Main/$ perf stat -p 31404
^C
 Performance counter stats for process id '31404':

      18958.321499      task-clock (msec)         #    1.000 CPUs utilized          
                28      context-switches          #    0.001 K/sec                  
                 6      cpu-migrations            #    0.000 K/sec                  
                 0      page-faults               #    0.000 K/sec                  
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

      18.959020937 seconds time elapsed
  • Task-clock(msec):CPU利用率,该值高,说明程序的多数时间花费在 CPU 计算上而非 IO
  • Context-switches:进程切换次数,记录了程序运行过程中发生了多少次进程切换,频繁的进程切换是应该避免的
  • Cache-misses:程序运行过程中总体的 cache 利用情况,如果该值过高,说明程序的 cache 利用不好
  • CPU-migrations:表示进程31404运行过程中发生了多少次 CPU 迁移,即被调度器从一个 CPU 转移到另外一个 CPU 上运行
  • Cycles:处理器时钟,一条机器指令可能需要多个 cycles
  • Instructions: 机器指令数目
  • IPC:是 Instructions/Cycles 的比值,该值越大越好,说明程序充分利用了处理器的特性
  • Cache-references: cache命中的次数
  • Cache-misses: cache失效的次数
  • Page-faults:缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中,或者请求的页面虽然在内存中,但物理地址和虚拟地址的映射关系尚未建立时,都会触发一次缺页异常。另外TLB不命中,页面访问权限不匹配等情况也会触发缺页异常。
  • Branches:遇到的分支指令数
  • Branch-misses:预测错误的分支指令数

perf record

剖析采样可以帮助我们采集到程序运行的特征,而且剖析精度非常高,可以定位到具体的代码行和指令块,perf record支持的参数如下:

-a, --all-cpus        system-wide collection from all CPUs
    -b, --branch-any      sample any taken branches
    -B, --no-buildid      do not collect buildids in perf.data
    -c, --count <n>       event period to sample
    -C, --cpu <cpu>       list of cpus to monitor
    -d, --data            Record the sample addresses
    -D, --delay <n>       ms to wait before starting measurement after program start (-1: start with events disabled)
    -e, --event <event>   event selector. use 'perf list' to list available events
    -F, --freq <freq or 'max'>
                          profile at this frequency
    -g                    enables call-graph recording
    -G, --cgroup <name>   monitor event in cgroup name only
    -I, --intr-regs[=<any register>]
                          sample selected machine registers on interrupt, use '-I?' to list register names
    -i, --no-inherit      child tasks do not inherit counters
    -j, --branch-filter <branch filter mask>
                          branch stack filter modes
    -k, --clockid <clockid>
                          clockid to use for events, see clock_gettime()
    -m, --mmap-pages <pages[,pages]>
                          number of mmap data pages and AUX area tracing mmap pages
    -N, --no-buildid-cache
                          do not update the buildid cache
    -n, --no-samples      don't sample
    -o, --output <file>   output file name
    -P, --period          Record the sample period
    -p, --pid <pid>       record events on existing process id
    -q, --quiet           don't print any message
    -R, --raw-samples     collect raw sample records from all opened counters
    -r, --realtime <n>    collect data with this RT SCHED_FIFO priority
    -S, --snapshot[=<opts>]
                          AUX area tracing Snapshot Mode
    -s, --stat            per thread counts
    -t, --tid <tid>       record events on existing thread id
    -T, --timestamp       Record the sample timestamps
    -u, --uid <user>      user to profile
    -v, --verbose         be more verbose (show counter open errors, etc)

通过record命令perf会把记录的数据存储在perf.data中,然后用report命令查看:

root: /5g_build/5g_Main/$ perf record -F 1000 -p 31404

以1kHz的采样率对进程31404进行采样

如果想关心perf的具体某个事件,希望按某个事件来进行统计则可以通过-e来指定某个事件

使用tracepoint的基本需求是对内核的运行时行为的关心

perf record -e raw_syscalls:sys_enter ./test列出系统调用次数,查看系统调用主要发生在哪里:

Samples: 144  of event 'raw_syscalls:sys_enter', Event count (approx.): 144
Overhead  Trace output
   5.56%  NR 9 (0, 1000, 3, 22, ffffffff, 0)
   2.08%  NR 3 (3, 1a39e, 1, 2, 3, 0)
   2.08%  NR 9 (0, 1a39e, 1, 2, 3, 0)
   1.39%  NR 0 (3, 7f661ec89000, 1000, 22, ffffffff, 0)
   0.69%  NR 0 (3, 7f661ec89000, 1000, 0, 3d7467786b746674, 7471766b70716f3d)
   0.69%  NR 0 (3, 7f661ec89000, 1000, 6874206461657220, 7f661ec894a8, 7f661ea64698)
   0.69%  NR 0 (3, 7ffe58453830, 340, 10eddc0, 80000000, 7ffe58453817)
   0.69%  NR 0 (3, 7ffe58453840, 340, 10ec700, 80000000, 7ffe58453827)
   0.69%  NR 0 (3, 7ffe58453ae0, 340, 10ec4f0, 90000001, 7ffe58453ac7)
   0.69%  NR 0 (3, 7ffe58453b00, 340, 10ed8d0, 90000001, 7ffe58453ae7)
   0.69%  NR 0 (3, 7ffe58453cc0, 340, 7f661ec8a4b8, 0, 7ffe58453ca7)
   0.69%  NR 0 (3, 7ffe58453cf0, 340, 7f661ec6ee58, 0, 7ffe58453cd7)
   0.69%  NR 0 (3, 7ffe58453d20, 340, 7f661ec6e988, 0, 7ffe58453d07)
   0.69%  NR 0 (3, 7ffe58453e40, 340, 7f661ec6e4b8, 0, 7ffe58453e27)
   0.69%  NR 0 (3, 7ffe58453f70, 340, 7f661ec8dc90, 4000000, 7ffe58453f57)
   0.69%  NR 0 (5, 7f661ec89000, 1000, 22, ffffffff, 0)
   0.69%  NR 0 (5, 7f661ec89000, 1000, 656d616e79622e77, 7f661ec896dd, 6f64616873206568)
   0.69%  NR 10 (600000, 1000, 1, 7f661ec8d148, 0, 0)
   0.69%  NR 10 (7f661d868000, 1ff000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661da67000, 1000, 1, 10eddf0, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 10 (7f661da79000, 200000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661dc79000, 1000, 1, 10ed8f0, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 10 (7f661dc91000, 1ff000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661de90000, 1000, 1, 10ed2c0, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 10 (7f661de9b000, 1ff000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661e09a000, 1000, 1, 10ece10, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 10 (7f661e0a3000, 1ff000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661e2a2000, 1000, 1, 7f661ec8a4d0, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 10 (7f661e2bc000, 1ff000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661e4bb000, 1000, 1, 7f661ec8a000, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 10 (7f661e4c3000, 200000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661e6c3000, 1000, 1, 7f661ec6e9a0, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 10 (7f661e861000, 1ff000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661ea60000, 4000, 1, 7f661ec6e4d0, 0, 7f661ec8d6e0)
   0.69%  NR 10 (7f661eb6e000, ff000, 0, 802, 3, 0)
   0.69%  NR 10 (7f661ec8b000, 1000, 1, 7f661ec8c998, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 102 (7f661eb6c842, 7f661ec6d821, 1, ff87a407, f8f0000000000000, 10f62b0)
   0.69%  NR 11 (7f661ec6f000, 1a39e, 2a6c, 7f661ec8c998, 7f661ec6e4d0, 7f661ec6e4d0)
   0.69%  NR 11 (7f661ec6f000, 1a39e, 7ffe58454057, 10ece10, 0, 7f661ec6e4d0)
   0.69%  NR 11 (7f661ec6f000, 1a39e, 7ffe58454077, 10ed8f0, 0, 7f661ec6e4d0)
   0.69%  NR 11 (7f661ec89000, 1000, 0, 6874206461657220, 7f661ec89400, 10ec6d0)
   0.69%  NR 11 (7f661ec89000, 1000, 0, fd09cf21, 10f6160, 7f661eb66700)
   0.69%  NR 11 (7f661ec89000, 1000, 0, fd09cf21, 7f661ec89600, 10eeab0)
   0.69%  NR 11 (7f661ec8a000, b, 1ce6e, 3a, 0, 70000021)
   0.69%  NR 12 (0, 0, 0, 0, 37f, 64)

使用-g选项可以查看到具体某个函数所花费的时间以及函数的调用路径:

Samples: 54K of event 'cpu-clock', Event count (approx.): 13565000000
  Children      Self  Command  Shared Object      Symbol
+   99.99%     0.00%  test     libc-2.22.so       [.] __libc_start_main
+   99.99%     0.00%  test     test               [.] main
+   99.99%    99.99%  test     test               [.] print
     0.00%     0.00%  test     [kernel.kallsyms]  [k] __do_softirq
     0.00%     0.00%  test     [kernel.kallsyms]  [k] ret_from_intr
     0.00%     0.00%  test     [kernel.kallsyms]  [k] system_call_fastpath
     0.00%     0.00%  test     [kernel.kallsyms]  [k] __irqentry_text_start
     0.00%     0.00%  test     [kernel.kallsyms]  [k] irq_exit
     0.00%     0.00%  test     [kernel.kallsyms]  [k] do_softirq
     0.00%     0.00%  test     [kernel.kallsyms]  [k] call_softirq
     0.00%     0.00%  test     [kernel.kallsyms]  [k] scsi_softirq_done
     0.00%     0.00%  test     [kernel.kallsyms]  [k] pagevec_lru_move_fn
     0.00%     0.00%  test     ld-2.22.so         [.] _dl_sysdep_start
     0.00%     0.00%  test     ld-2.22.so         [.] munmap
     0.00%     0.00%  test     [kernel.kallsyms]  [k] __call_rcu
     0.00%     0.00%  test     [kernel.kallsyms]  [k] _raw_qspin_lock
     0.00%     0.00%  test     [kernel.kallsyms]  [k] get_empty_filp
     0.00%     0.00%  test     ld-2.22.so         [.] _dl_map_object
     0.00%     0.00%  test     libc-2.22.so       [.] _IO_file_close
     0.00%     0.00%  test     [unknown]          [k] 0000000000000000
     0.00%     0.00%  test     [kernel.kallsyms]  [k] int_signal
     0.00%     0.00%  test     [kernel.kallsyms]  [k] sys_munmap
     0.00%     0.00%  test     ld-2.22.so         [.] _dl_setup_hash
     0.00%     0.00%  test     libc-2.22.so       [.] _IO_file_open
     0.00%     0.00%  test     [kernel.kallsyms]  [k] async_page_fault
     0.00%     0.00%  test     [kernel.kallsyms]  [k] do_notify_resume
     0.00%     0.00%  test     [kernel.kallsyms]  [k] vm_munmap
     0.00%     0.00%  test     [kernel.kallsyms]  [k] do_async_page_fault
     0.00%     0.00%  test     [kernel.kallsyms]  [k] do_munmap
     0.00%     0.00%  test     [kernel.kallsyms]  [k] sys_open
     0.00%     0.00%  test     [kernel.kallsyms]  [k] task_work_run
     0.00%     0.00%  test     [kernel.kallsyms]  [k] ____fput
     0.00%     0.00%  test     [kernel.kallsyms]  [k] do_sys_open
     0.00%     0.00%  test     [kernel.kallsyms]  [k] trace_do_page_fault
     0.00%     0.00%  test     [kernel.kallsyms]  [k] unmap_region
     0.00%     0.00%  test     [kernel.kallsyms]  [k] __do_page_fault
     0.00%     0.00%  test     [kernel.kallsyms]  [k] __fput
     0.00%     0.00%  test     [kernel.kallsyms]  [k] do_filp_open
     0.00%     0.00%  test     [kernel.kallsyms]  [k] lru_add_drain
     0.00%     0.00%  test     [kernel.kallsyms]  [k] blk_done_softirq
     0.00%     0.00%  test     [kernel.kallsyms]  [k] call_rcu_sched
     0.00%     0.00%  test     [kernel.kallsyms]  [k] handle_mm_fault
     0.00%     0.00%  test     [kernel.kallsyms]  [k] lru_add_drain_cpu
     0.00%     0.00%  test     [kernel.kallsyms]  [k] path_openat

perf report

Usage: perf report [<options>]

    -b, --branch-stack    use branch records for per branch histogram filling
    -c, --comms <comm[,comm...]>
                          only consider symbols in these comms
    -C, --cpu <cpu>       list of cpus to profile
    -d, --dsos <dso[,dso...]>
                          only consider symbols in these dsos
    -D, --dump-raw-trace  dump raw trace in ASCII
    -F, --fields <key[,keys...]>
                          output field(s): overhead, period, sample plus all of sort keys
    -f, --force           don't complain, do it
    -g, --call-graph <print_type,threshold[,print_limit],order,sort_key[,branch],value>
                          Display call graph (stack chain/backtrace):

                                print_type:     call graph printing style (graph|flat|fractal|folded|none)
                                threshold:      minimum call graph inclusion threshold (<percent>)
                                print_limit:    maximum number of call graph entry (<number>)
                                order:          call graph order (caller|callee)
                                sort_key:       call graph sort key (function|address)
                                branch:         include last branch info to call graph (branch)
                                value:          call graph value (percent|period|count)

                                Default: graph,0.5,caller,function,percent
    -G, --inverted        alias for inverted call graph
    -i, --input <file>    input file name
    -I, --show-info       Display extended information about perf.data file
    -k, --vmlinux <file>  vmlinux pathname
    -M, --disassembler-style <disassembler style>
                          Specify disassembler style (e.g. -M intel for intel syntax)
    -m, --modules         load module symbols - WARNING: use only with -k and LIVE kernel
    -n, --show-nr-samples
                          Show a column with the number of samples
    -p, --parent <regex>  regex filter to identify parent, see: '--sort parent'
    -q, --quiet           Do not show any message
    -s, --sort <key[,key2...]>
                          sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.
    -S, --symbols <symbol[,symbol...]>
                          only consider these symbols
    -t, --field-separator <separator>
                          separator for columns, no spaces will be added between columns '.' is reserved.
    -T, --threads         Show per-thread event counters
    -U, --hide-unresolved
                          Only display entries resolved to a symbol
    -v, --verbose         be more verbose (show symbol address, etc)
    -w, --column-widths <width[,width...]>
                          don't try to adjust column width, use these fixed values
    -x, --exclude-other   Only display entries with parent-match
        --asm-raw         Display raw encoding of assembly instructions (default)
        --branch-history  add last branch records to call history
        --children        Accumulate callchains of children and show total overhead as well
        --demangle        Disable symbol demangling
        --demangle-kernel
                          Enable kernel symbol demangling
        --full-source-path
                          Show full source file name path for source lines
        --group           Show event group information together
        --gtk             Use the GTK2 interface
        --header          Show data header.
        --header-only     Show only data header.
        --hierarchy       Show entries in a hierarchy
        --ignore-callees <regex>
                          ignore callees of these functions in call graphs
        --inline          Show inline function
        --itrace[=<opts>]
                          Instruction Tracing options
        --kallsyms <file>
                          kallsyms pathname
        --max-stack <n>   Set the maximum stack depth when parsing the callchain, anything beyond the specified depth will be ignored. Default: kernel.perf_event_max_stack or 127
        --mem-mode        mem access profile
        --objdump <path>  objdump binary to use for disassembly and annotations
        --percent-limit <percent>
                          Don't show entries under that percent
        --percentage <relative|absolute>
                          how to display percentage of filtered entries
        --pid <pid[,pid...]>
                          only consider symbols in these pids
        --pretty <key>    pretty printing style key: normal raw
        --raw-trace       Show raw trace event output (do not use print fmt or plugins)
        --show-cpu-utilization
                          Show sample percentage for different cpu modes
        --show-ref-call-graph
                          Show callgraph from reference event
        --show-total-period
                          Show a column with the sum of periods
        --socket-filter <n>
                          only show processor socket that match with this filter
        --source          Interleave source code with assembly code (default)
        --stdio           Use the stdio interface
        --stdio-color <mode>
                          'always' (default), 'never' or 'auto' only applicable to --stdio mode
        --symbol-filter <filter>
                          only show symbols that (partially) match with this filter
        --symfs <directory>
                          Look for files with symbols relative to this directory
        --tid <tid[,tid...]>
                          only consider symbols in these tids
        --time <str>      Time span of interest (start,stop)
        --tui             Use the TUI interface

使用命令perf report -i perf.data > perf.txt把结果写入文本文件:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 21K of event 'cpu-clock'
# Event count (approx.): 21067000000
#
# Overhead  Command  Shared Object      Symbol                                    
# ........  .......  .................  ..........................................
#
    99.87%  test     test               [.] print
     0.04%  test     [kernel.kallsyms]  [k] __do_softirq
     0.02%  test     [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.01%  test     [kernel.kallsyms]  [k] find_busiest_group
     0.01%  test     [kernel.kallsyms]  [k] run_timer_softirq
     0.00%  test     [kernel.kallsyms]  [k] _raw_qspin_lock
     0.00%  test     [kernel.kallsyms]  [k] check_for_new_grace_period.isra.26
     0.00%  test     [kernel.kallsyms]  [k] iowrite16
     0.00%  test     [kernel.kallsyms]  [k] kmem_cache_free
     0.00%  test     [kernel.kallsyms]  [k] mod_timer_pending
     0.00%  test     [kernel.kallsyms]  [k] nf_ct_delete_from_lists
     0.00%  test     [kernel.kallsyms]  [k] packet_rcv
     0.00%  test     [kernel.kallsyms]  [k] rcu_process_callbacks
     0.00%  test     [kernel.kallsyms]  [k] rcu_process_gp_end
     0.00%  test     [kernel.kallsyms]  [k] rebalance_domains
     0.00%  test     [kernel.kallsyms]  [k] source_load


#
# (Tip: Save output of perf stat using: perf stat record <target workload>)
#
  • Overhead:指出了该Symbol采样在总采样中所占的百分比。在当前场景下,表示了该Symbol消耗的CPU时间占总CPU时间的百分比
  • Command:进程名
  • Shared Object:模块名, 比如具体哪个共享库,哪个可执行程序
  • Symbol:二进制模块中的符号名,如果是高级语言,比如C语言编写的程序,等价于函数名

perf report命令下移动光标到某一函数上敲击Enter键,还可以更详细地查看分析结果:

Annotate print                   --- 分析print函数中指令或者代码的性能
Zoom into test thread             --- 聚焦到线程 test
Zoom into test DSO              --- 聚焦到动态共享对象test
Browse map details                --- 查看map
Run scripts for samples of thread [test]--- 针对test线程的采样运行脚本
Run scripts for samples of symbol [test] --- 针对函数的采样运行脚本   
Run scripts for all samples       --- 针对所有采样运行脚步
Switch to another data file in PWD --- 切换到当前目录中另一个数据文件
Exit

annotate

Usage: perf annotate [<options>]

    -C, --cpu <cpu>       list of cpus to profile
    -d, --dsos <dso[,dso...]>
                          only consider symbols in these dsos
    -D, --dump-raw-trace  dump raw trace in ASCII
    -f, --force           don't complain, do it
    -i, --input <file>    input file name
    -k, --vmlinux <file>  vmlinux pathname
    -l, --print-line      print matching source lines (may be slow)
    -M, --disassembler-style <disassembler style>
                          Specify disassembler style (e.g. -M intel for intel syntax)
    -m, --modules         load module symbols - WARNING: use only with -k and LIVE kernel
    -n, --show-nr-samples
                          Show a column with the number of samples
    -P, --full-paths      Don't shorten the displayed pathnames
    -q, --quiet           do now show any message
    -s, --symbol <symbol>
                          symbol to annotate
    -v, --verbose         be more verbose (show symbol address, etc)
        --asm-raw         Display raw encoding of assembly instructions (default)
        --group           Show event group information together
        --gtk             Use the GTK interface
        --objdump <path>  objdump binary to use for disassembly and annotations
        --show-total-period
                          Show a column with the sum of periods
        --skip-missing    Skip symbols that cannot be annotated
        --source          Interleave source code with assembly code (default)
        --stdio           Use the stdio interface
        --stdio-color <mode>
                          'always' (default), 'never' or 'auto' only applicable to --stdio mode
        --symfs <directory>
                          Look for files with symbols relative to this directory
        --tui             Use the TUI interface

可以直接使用annotate 来单独分析函数信息:

perf annotate -l -s <func>

perf top

用于实时显示当前系统的性能统计信息。该命令主要用来观察整个系统当前的状态,比如可以通过查看该命令的输出来查看当前系统最耗时的内核函数或某个用户进程,支持参数如下:

Usage: perf top [<options>]

    -a, --all-cpus        system-wide collection from all CPUs
    -b, --branch-any      sample any taken branches
    -c, --count <n>       event period to sample
    -C, --cpu <cpu>       list of cpus to monitor
    -d, --delay <n>       number of seconds to delay between refreshes
    -D, --dump-symtab     dump the symbol table used for profiling
    -E, --entries <n>     display this many functions
    -e, --event <event>   event selector. use 'perf list' to list available events
    -f, --count-filter <n>
                          only display functions with more events than this
    -F, --freq <n>        profile at this frequency
    -g                    enables call-graph recording and display
    -i, --no-inherit      child tasks do not inherit counters
    -j, --branch-filter <branch filter mask>
                          branch stack filter modes
    -K, --hide_kernel_symbols
                          hide kernel symbols
    -k, --vmlinux <file>  vmlinux pathname
    -M, --disassembler-style <disassembler style>
                          Specify disassembler style (e.g. -M intel for intel syntax)
                          hide kernel symbols
    -k, --vmlinux <file>  vmlinux pathname
    -M, --disassembler-style <disassembler style>
                          Specify disassembler style (e.g. -M intel for intel syntax)
    -m, --mmap-pages <pages>
                          number of mmap data pages
    -n, --show-nr-samples
                          Show a column with the number of samples
    -p, --pid <pid>       profile events on existing process id
    -r, --realtime <n>    collect data with this RT SCHED_FIFO priority
    -s, --sort <key[,key2...]>
                          sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.
    -t, --tid <tid>       profile events on existing thread id
    -U, --hide_user_symbols
                          hide user symbols
    -u, --uid <user>      user to profile
    -v, --verbose         be more verbose (show counter open errors, etc)
    -w, --column-widths <width[,width...]>
                          don't try to adjust column width, use these fixed values
    -z, --zero            zero history across updates
        --asm-raw         Display raw encoding of assembly instructions (default)
        --call-graph <record_mode[,record_size],print_type,threshold[,print_limit],order,sort_key[,branch]>
                          setup and enables call-graph (stack chain/backtrace):

                                record_mode:    call graph recording mode (fp|dwarf|lbr)
                                record_size:    if record_mode is 'dwarf', max size of stack recording (<bytes>)
                                                default: 8192 (bytes)

                                record_mode:    call graph recording mode (fp|dwarf|lbr)
                                record_size:    if record_mode is 'dwarf', max size of stack recording (<bytes>)
                                                default: 8192 (bytes)

                                record_mode:    call graph recording mode (fp|dwarf|lbr)
                                record_size:    if record_mode is 'dwarf', max size of stack recording (<bytes>)
                                                default: 8192 (bytes)

                                record_mode:    call graph recording mode (fp|dwarf|lbr)
                                record_size:    if record_mode is 'dwarf', max size of stack recording (<bytes>)
                                                default: 8192 (bytes)
    -z, --zero            zero history across updates
                                record_mode:    call graph recording mode (fp|dwarf|lbr)
                                record_size:    if record_mode is 'dwarf', max size of stack recording (<bytes>)
                                                default: 8192 (bytes)
                                print_type:     call graph printing style (graph|flat|fractal|folded|none)
                                threshold:      minimum call graph inclusion threshold (<percent>)
                                print_limit:    maximum number of call graph entry (<number>)
                                order:          call graph order (caller|callee)
                                sort_key:       call graph sort key (function|address)
                                branch:         include last branch info to call graph (branch)
                                value:          call graph value (percent|period|count)

                                Default: fp,graph,0.5,caller,function
        --children        Accumulate callchains of children and show total overhead as well
        --percent-limit <percent>
                          Don't show entries under that percent
        --percentage <relative|absolute>
                          How to display percentage of filtered entries
        --proc-map-timeout <n>
                          per thread proc mmap processing timeout in ms
        --raw-trace       Show raw trace event output (do not use print fmt or plugins)
        --show-total-period
                          Show a column with the sum of periods
        --source          Interleave source code with assembly code (default)
        --stdio           Use the stdio interface
        --sym-annotate <symbol name>
                          symbol to annotate
        --symbols <symbol[,symbol...]>
                          only consider these symbols
        --tui             Use the TUI interface

常用选项:

-e <event>:指明要分析的性能事件。
-p <pid>:Profile events on existing Process ID (comma sperated list). 仅分析目标进程及其创建的线程。
-k <path>:Path to vmlinux. Required for annotation functionality. 带符号表的内核映像所在的路径。
-K:不显示属于内核或模块的符号。
-U:不显示属于用户态程序的符号。
-d <n>:界面的刷新周期,默认为2s,因为perf top默认每2s从mmap的内存区域读取一次性能数据。
-g:得到函数的调用关系图

 

Samples: 290K of event 'cpu-clock', Event count (approx.): 21181763519
Overhead  Shared Object                 Symbol
  74.18%  test                          [.] print
   3.76%  [kernel]                      [k] _raw_spin_unlock_irqrestore
   1.47%  [kernel]                      [k] __do_softirq
   0.70%  [kernel]                      [k] run_timer_softirq
   0.55%  [kernel]                      [k] finish_task_switch
   0.52%  [kernel]                      [k] __do_page_fault
   0.34%  [kernel]                      [k] rcu_process_callbacks
   0.31%  [kernel]                      [k] get_vmalloc_info
   0.27%  [kernel]                      [k] find_busiest_group
   0.17%  [kernel]                      [k] rcu_gp_kthread
   0.15%  [kernel]                      [k] free_hot_cold_page
   0.15%  [kernel]                      [k] copy_page_rep
   0.14%  [unknown]                     [.] 0x00007f0f77aaab69
   0.13%  [kernel]                      [k] kmem_cache_free
   0.13%  [kernel]                      [k] clear_page_c_e
   0.13%  [kernel]                      [k] system_call_after_swapgs
   0.12%  [kernel]                      [k] rcu_process_gp_end
   0.12%  [unknown]                     [.] 0x00007f0f77aaab6c
   0.12%  [kernel]                      [k] _raw_qspin_lock
   0.11%  [kernel]                      [k] tick_nohz_idle_enter
   0.11%  [kernel]                      [k] copy_user_enhanced_fast_string
   0.11%  [unknown]                     [.] 0x00007f0f77aaab7a
   0.11%  [kernel]                      [k] unmap_page_range
   0.11%  [kernel]                      [k] __slab_free
   0.10%  [kernel]                      [k] get_page_from_freelist
   0.10%  [kernel]                      [k] load_balance
   0.10%  [kernel]                      [k] rebalance_domains
   0.10%  libpthread-2.22.so            [.] __pthread_rwlock_unlock
   0.10%  [kernel]                      [k] check_for_new_grace_period.isra.26
   0.09%  rg                            [.] __pthread_mutex_unlock
   0.09%  [unknown]                     [.] 0x00007f0f77aaad26
   0.09%  [unknown]                     [.] 0x00007f0f77aaab7d
   0.09%  [unknown]                     [.] 0x00007f0f77aaad29
   0.09%  [kernel]                      [k] file_free_rcu
   0.09%  [kernel]                      [k] free_page_and_swap_cache
   0.09%  [unknown]                     [.] 0x00007f0f77aaab76
   0.09%  [unknown]                     [.] 0x00007f0f77aaab60
   0.09%  [kernel]                      [k] rcu_idle_exit
   0.08%  [unknown]                     [.] 0x00007f0f77aaabb6
   0.08%  [kernel]                      [k] __find_get_page
   0.08%  [unknown]                     [.] 0x00007f0f77aaabcd
  • 第一列:符号引发的性能事件的比例,指占用的cpu周期比例
  • 第二列:符号所在的DSO(Dynamic Shared Object),可以是应用程序、内核、动态链接库、模块
  • 第三列:DSO的类型,[.]表示此符号属于用户态的ELF文件,包括可执行文件与动态链接库;[k]表述此符号属于内核或模块
  • 第四列:符号名,有些符号不能解析为函数名,只能用地址表示

perf top命令和linux下的top命令有点相似,实时打印出系统中被采样事件的状态和统计数据。perf top主要用于实时剖析各个函数在某个性能 事件(event)上的热度,默认的event是cycles(cpu周期数),这样可以检测系统中所有应用层和内核层函数的热度。

perf top支持两种输出界面,tui和tty,默认是tui,因为tui需要更多的环境和库支持,所以经常出现乱码问题,所以本文都是基于tty界面分析(–stdio)。

直接执行perf top监控的是整个系统中所有进程的状态,多数情况我们只关心某个进程,或者想定位某个线程的性能问题,perf top都是支持的(-p / -t)。

需要进入函数内部一探究竟,有时对于像上面的DH_SSM_BLKBUF_ALLOC这样的函数的调用堆栈,以定位到是哪里在频繁调用。这时候可以执行:

perf top -t 31404 --stdio -g -K

-g参数就是显示函数的调用堆栈,-k是为了只输出应用层函数

perf top --call-graph [fractal] 路径概率为相对值,加起来为100%,调用顺序为从下往上

perf top --call-graph graph 路径概率为绝对值,加起来为该函数的热度

关于perf top界面常用命令如下:

h:显示帮助,即可显示详细的帮助信息。
UP/DOWN/PGUP/PGDN/SPACE:上下和翻页。
a:annotate current symbol,注解当前符号。能够给出汇编语言的注解,给出各条指令的采样率。
d:过滤掉所有不属于此DSO的符号。非常方便查看同一类别的符号。
P:将当前信息保存到perf.hist.N中

perf probe

能够动态地在想查看的地方插入动态监测点

perf prob schedule:12 cpu 在内核函数 schedule() 的第 12 行处加入了一个动态 probe 点,和 tracepoint 的功能一样,内核一旦运行到该 probe 点时,便会通知 perf。可以理解为动态增加了一个新的 tracepoint

perf sched

perf sched提供了许多工具来分析内核CPU调度器的行为,可以用它来识别和量化调度器延迟的问题

  • perf sched专门用于跟踪/测量调度器,包括延时等
  • perf sched record <command>:记录测试过程中的调度事件
  • perf sched latency:报告线程调度延时和其他调度相关属性
  • perf sched script:查看执行过程中详细的trace信息
  • perf sched replay:回放perf sched record录制的执行过程
  • perf sched map:用字符表示打印上下文切换

执行sudo perf sched record ls后,通过不同方式查看结果

perf sched record sleep 10

perf sched latency --sort max

-----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  :14:14                |      0.323 ms |       10 | avg:    1.417 ms | max:    5.016 ms | max at: 2133786.303716 s
  :34:34                |      0.224 ms |       10 | avg:    1.043 ms | max:    5.013 ms | max at: 2133789.922667 s
  :9:9                  |     11.056 ms |     1653 | avg:    0.012 ms | max:    4.005 ms | max at: 2133795.230575 s
  :20834:20834          |      1.977 ms |       10 | avg:    0.737 ms | max:    3.854 ms | max at: 2133792.826450 s
  :44:44                |      0.182 ms |        4 | avg:    1.231 ms | max:    3.436 ms | max at: 2133788.327110 s
  :4603:4603            |     23.657 ms |       49 | avg:    0.073 ms | max:    3.362 ms | max at: 2133795.242342 s
  :24:24                |      0.375 ms |       15 | avg:    0.604 ms | max:    3.015 ms | max at: 2133795.072595 s
  :19:19                |      0.330 ms |       13 | avg:    0.517 ms | max:    2.981 ms | max at: 2133792.114617 s
  :28878:28878          |      0.112 ms |       10 | avg:    0.859 ms | max:    2.894 ms | max at: 2133793.591518 s
  :18203:18203          |      0.307 ms |       25 | avg:    0.264 ms | max:    2.846 ms | max at: 2133792.579451 s
  grep:(29)             |      0.000 ms |     1650 | avg:    0.015 ms | max:    2.683 ms | max at: 2133792.059059 s
  :4600:4600            |     24.257 ms |       47 | avg:    0.060 ms | max:    2.642 ms | max at: 2133795.230570 s
  :11084:11084          |     19.280 ms |     1575 | avg:    0.008 ms | max:    2.636 ms | max at: 2133792.059051 s
  :4310:4310            |      1.744 ms |        3 | avg:    1.049 ms | max:    2.562 ms | max at: 2133787.860902 s
  :4607:4607            |     13.711 ms |       26 | avg:    0.282 ms | max:    2.376 ms | max at: 2133795.244725 s
  :11112:11112          |      0.304 ms |        1 | avg:    2.362 ms | max:    2.362 ms | max at: 2133792.058750 s
  :4596:4596            |     12.092 ms |      248 | avg:    0.039 ms | max:    2.171 ms | max at: 2133795.238976 s
  :4267:4267            |      1.886 ms |        3 | avg:    0.726 ms | max:    2.163 ms | max at: 2133786.389122 s
  :20841:20841          |     11.626 ms |       15 | avg:    0.260 ms | max:    2.082 ms | max at: 2133792.815333 s
  :20834:20834          |      1.977 ms |       10 | avg:    0.737 ms | max:    3.854 ms | max at: 2133792.826450 s
  :44:44                |      0.182 ms |        4 | avg:    1.231 ms | max:    3.436 ms | max at: 2133788.327110 s
  :4603:4603            |     23.657 ms |       49 | avg:    0.073 ms | max:    3.362 ms | max at: 2133795.242342 s
  :24:24                |      0.375 ms |       15 | avg:    0.604 ms | max:    3.015 ms | max at: 2133795.072595 s
  :19:19                |      0.330 ms |       13 | avg:    0.517 ms | max:    2.981 ms | max at: 2133792.114617 s
  :28878:28878          |      0.112 ms |       10 | avg:    0.859 ms | max:    2.894 ms | max at: 2133793.591518 s
  :18203:18203          |      0.307 ms |       25 | avg:    0.264 ms | max:    2.846 ms | max at: 2133792.579451 s
  :4235:4235            |      1.522 ms |        3 | avg:    0.553 ms | max:    1.618 ms | max at: 2133786.185985 s
  :4493:4493            |      1.559 ms |        3 | avg:    0.543 ms | max:    1.610 ms | max at: 2133791.087969 s
  :20838:20838          |      0.694 ms |       90 | avg:    0.028 ms | max:    1.554 ms | max at: 2133792.814760 s
  :4606:4606            |     20.843 ms |       38 | avg:    0.131 ms | max:    1.532 ms | max at: 2133795.221587 s
  :21475:21475          |      1.136 ms |       33 | avg:    0.051 ms | max:    1.532 ms | max at: 2133792.814738 s
  :511:511              |     10.726 ms |       40 | avg:    0.262 ms | max:    1.521 ms | max at: 2133795.228573 s
  ipset:(28)            |      0.000 ms |      473 | avg:    0.035 ms | max:    1.482 ms | max at: 2133790.437174 s
  :7351:7351            |     37.613 ms |       43 | avg:    0.089 ms | max:    1.468 ms | max at: 2133790.594637 s
  :7274:7274            |     43.539 ms |       50 | avg:    0.035 ms | max:    1.468 ms | max at: 2133790.437176 s
  :23220:23220          |      0.205 ms |       14 | avg:    0.148 ms | max:    1.448 ms | max at: 2133790.570121 s
  :28112:28112          |     18.778 ms |      289 | avg:    0.015 ms | max:    1.417 ms | max at: 2133792.059010 s
  • Task:进程的名字和 pid
  • Runtime:实际运行时间
  • Switches:进程切换的次数
  • Average delay:平均的调度延迟
  • Maximum delay:最大延迟

perf sched map

perf sched record sleep 10
perf sched map

显示了所有的cpu和上下文事件

*.   .   .   .   C0  .        2133785.243718 secs 
       .  *G0  .   .   C0  .        2133785.244558 secs G0 => :16740:16740
  *H0  .   G0  .   .   C0  .        2133785.244561 secs H0 => :24283:24283
   H0  .   G0  .  *I0  C0  .        2133785.244572 secs I0 => :16047:16047
   H0  .  *.   .   I0  C0  .        2133785.244582 secs 
  *.   .   .   .   I0  C0  .        2133785.244585 secs 
   .   .   .   .   I0  C0 *J0       2133785.244594 secs J0 => :24282:24282
   .   .   .   .  *.   C0  J0       2133785.244594 secs 
   .   .   .   .   .   C0 *.        2133785.244615 secs 
   .   .   .   .   .  *.   .        2133785.244748 secs 
   .  *A0  .   .   .   .   .        2133785.246709 secs 
   .  *.   .   .   .   .   .        2133785.246713 secs 
   .  *A0  .   .   .   .   .        2133785.249709 secs 
   .  *.   .   .   .   .   .        2133785.249715 secs 
   .  *A0  .   .   .   .   .        2133785.252711 secs 
   .  *.   .   .   .   .   .        2133785.252715 secs 
   .   .   .   .  *K0  .   .        2133785.252788 secs K0 => :31257:31257
   .   .   .   .  *.   .   .        2133785.252800 secs 
   .   .   .   .   .  *L0  .        2133785.257075 secs L0 => :28112:28112
  *M0  .   .   .   .   L0  .        2133785.257107 secs M0 => :11084:11084
   M0  .   .   .   .  *.   .        2133785.257121 secs 
   M0 *N0  .   .   .   .   .        2133785.257121 secs N0 => :11137:11137
  *.   N0  .   .   .   .   .        2133785.257122 secs 
   .   N0  .   .   .   .   .  *O0   2133785.257132 secs O0 => :11121:11121
   .   N0  .   .   .   .   .  *.    2133785.257145 secs 
   .   N0 *P0  .   .   .   .   .    2133785.257158 secs P0 => :11081:11081
   .  *.   P0  .   .   .   .   .    2133785.257161 secs 
   .   .  *.   .   .   .   .   .    2133785.257167 secs 
  *M0  .   .   .   .   .   .   .    2133785.257199 secs 
  *.   .   .   .   .   .   .   .    2133785.257203 secs 
   .   .   .   .  *Q0  .   .   .    2133785.270713 secs Q0 => :4953:4953
   .   .   .   .  *.   .   .   .    2133785.270729 secs 
  *R0  .   .   .   .   .   .   .    2133785.278385 secs R0 => :19879:19879
   R0 *S0  .   .   .   .   .   .    2133785.278427 secs S0 => :10538:10538
   R0  S0 *T0  .   .   .   .   .    2133785.278430 secs T0 => :10550:10550
  *.   S0  T0  .   .   .   .   .    2133785.278443 secs 
   .  *.   T0  .   .   .   .   .    2133785.278444 secs 
   .   .  *.   .   .   .   .   .    2133785.278444 secs 
   .   .   .   .   .   .  *U0  .    2133785.278445 secs U0 => :13461:13461
   .   .   .   .   .   .  *.   .    2133785.278460 secs 
   .  *S0  .   .   .   .   .   .    2133785.278526 secs 
   .  *.   .   .   .   .   .   .    2133785.278533 secs 
   .   .   .   .  *V0  .   .   .    2133785.279276 secs V0 => :11903:11903
  *M0  .   .   .   V0  .   .   .    2133785.279299 secs 
  *.   .   .   .   V0  .   .   .    2133785.279308 secs 
   .   .   .   .  *.   .   .   .    2133785.279317 secs

每个CPU一列,星号表示调度事件正在发生的CPU,点号表示CPU正在IDLE

Map 的好处在于提供了一个的总的视图,将成百上千的调度事件进行总结,显示了系统任务在 CPU 之间的分布,假如有不好的调度迁移,比如一个任务没有被及时迁移到 idle 的 CPU 却被迁移到其他忙碌的 CPU,类似这种调度器的问题可以从 map 的报告中一眼看出来

perf sched script

perf sched script能看到更详细的sched信息,包括sched_wakeup/sched_switch等等。每一列的含义依次是:进程名/pid/CPU ID/时间戳

perf sched record sleep 10
perf sched script

显示调度相关的事件

swapper     0 [006] 2133785.242711:       sched:sched_switch: swapper/6:0 [120] R ==> rcu_sched:9 [120]
         swapper     0 [006] 2133785.242716:       sched:sched_switch: rcu_sched:9 [120] S ==> swapper/6:0 [120]
            perf 14836 [001] 2133785.243035:       sched:sched_wakeup: node:461 [120] success=1 CPU:002
         swapper     0 [002] 2133785.243039:       sched:sched_switch: swapper/2:0 [120] R ==> node:461 [120]
            perf 14836 [001] 2133785.243044:       sched:sched_wakeup: perf:4209 [120] success=1 CPU:003
         swapper     0 [003] 2133785.243052:       sched:sched_switch: swapper/3:0 [120] R ==> perf:4209 [120]
            node 16488 [002] 2133785.243056:       sched:sched_wakeup: node:460 [120] success=1 CPU:004
         swapper     0 [004] 2133785.243060:       sched:sched_switch: swapper/4:0 [120] R ==> node:460 [120]
            node 16488 [002] 2133785.243062: sched:sched_stat_runtime: comm=node pid=461 runtime=31449 [ns] vruntime=1973914749726 [ns]
            node 16488 [002] 2133785.243064:       sched:sched_switch: node:461 [120] S ==> swapper/2:0 [120]
            node 16487 [004] 2133785.243064: sched:sched_stat_runtime: comm=node pid=460 runtime=12587 [ns] vruntime=441872725215 [ns]
            node 16487 [004] 2133785.243066:       sched:sched_switch: node:460 [120] S ==> swapper/4:0 [120]
            perf 14836 [001] 2133785.243069:       sched:sched_wakeup: node:461 [120] success=1 CPU:002
         swapper     0 [002] 2133785.243072:       sched:sched_switch: swapper/2:0 [120] R ==> node:461 [120]
            node 16488 [002] 2133785.243087:       sched:sched_wakeup: node:460 [120] success=1 CPU:004
         swapper     0 [004] 2133785.243090:       sched:sched_switch: swapper/4:0 [120] R ==> node:460 [120]
            node 16487 [004] 2133785.243092: sched:sched_stat_runtime: comm=node pid=460 runtime=8170 [ns] vruntime=441872733385 [ns]
            node 16487 [004] 2133785.243093:       sched:sched_switch: node:460 [120] S ==> swapper/4:0 [120]
            node 16488 [002] 2133785.243098:       sched:sched_wakeup: node:460 [120] success=1 CPU:004

可以查看schedule的trace

perf sched timehist

该命令可以获得task的wait time, 特别地, 还能拿到sch delay。timehist统计的sch delay是通过sched_switch和sched_wakeup计算出来的, 而不是stat_wait。

root: /5g_build/5g_Main/$ perf sched record sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.498 MB perf.data (9250 samples) ]
root: /5g_build/5g_Main/$ perf sched timehist
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
 2155009.792729 [0005]  <idle>                              0.000      0.000      0.000 
 2155009.792733 [0005]  swapper                             0.000      0.000      0.003 
 2155009.793148 [0004]  <idle>                              0.000      0.000      0.000 
 2155009.793162 [0005]  <idle>                              0.003      0.000      0.429 
 2155009.793169 [0004]  node[16488/16476]                   0.000      0.000      0.020 
 2155009.793170 [0006]  <idle>                              0.000      0.000      0.000 
 2155009.793184 [0006]  node[16487/16476]                   0.000      0.000      0.013 
 2155009.793191 [0004]  <idle>                              0.020      0.000      0.022 
 2155009.793207 [0004]  node[16488/16476]                   0.022      0.000      0.016 
 2155009.793208 [0006]  <idle>                              0.013      0.000      0.023 
 2155009.793215 [0006]  node[16487/16476]                   0.023      0.000      0.007 
 2155009.793216 [0004]  <idle>                              0.016      0.000      0.008 
 2155009.793232 [0004]  node[16488/16476]                   0.008      0.000      0.015 
 2155009.793232 [0006]  <idle>                              0.007      0.000      0.017 
 2155009.793239 [0006]  node[16487/16476]                   0.017      0.000      0.006 
 2155009.793243 [0004]  <idle>                              0.015      0.000      0.011 
 2155009.793258 [0004]  node[16488/16476]                   0.011      0.000      0.015 
 2155009.793259 [0006]  <idle>                              0.006      0.000      0.020 
 2155009.793265 [0006]  node[16487/16476]                   0.020      0.000      0.006 
 2155009.793267 [0004]  <idle>                              0.015      0.000      0.008 
 2155009.793283 [0004]  node[16488/16476]                   0.008      0.000      0.015 
 2155009.793283 [0006]  <idle>                              0.006      0.000      0.018 
 2155009.793290 [0006]  node[16487/16476]                   0.018      0.000      0.006 
 2155009.793302 [0004]  <idle>                              0.015      0.000      0.019 
 2155009.793315 [0005]  perf[32192]                         0.000      0.000      0.152 
 2155009.793318 [0006]  <idle>                              0.006      0.000      0.028 
 2155009.793325 [0006]  node[16487/16476]                   0.028      0.000      0.006 
 2155009.793327 [0005]  swapper                             0.582      0.000      0.011 
 2155009.793334 [0007]  <idle>                              0.000      0.000      0.000 
 2155009.793336 [0006]  <idle>                              0.006      0.000      0.011 
 2155009.793342 [0006]  node[16487/16476]                   0.011      0.000      0.006 
 2155009.793351 [0006]  <idle>                              0.006      0.000      0.008 
 2155009.793358 [0006]  node[16487/16476]                   0.008      0.000      0.006 
 2155009.793366 [0006]  <idle>                              0.006      0.000      0.008 
 2155009.793373 [0006]  node[16487/16476]                   0.008      0.000      0.006

perf lock·

锁是内核同步的方法,一旦加了锁,其他准备加锁的内核执行路径就必须等待,降低了并行。因此对于锁进行专门分析应该是调优的一项重要工作

  • “Name”: 锁的名字,比如 md->map_lock,即定义在 dm.c 结构 mapped_device 中的读写锁
  • “acquired”: 该锁被直接获得的次数,即没有其他内核路径拥有该锁的情况下得到该锁的次数
  • “contended”冲突的次数,即在准备获得该锁的时候已经被其他人所拥有的情况的出现次数
  • “total wait”:为了获得该锁,总共的等待时间
  • “max wait”:为了获得该锁,最大的等待时间
  • “min wait”:为了获得该锁,最小的等待时间

perf kmem

Perf Kmem 专门收集内核 slab 分配器的相关事件。比如内存分配,释放等。可以用来研究程序在哪里分配了大量内存,或者在什么地方产生碎片之类的和内存管理相关的问题

  • perf kmem record: 抓取命令的内核slab分配器事件
  • perf kmem stat: 生成内核slab分配器统计信息
Usage: perf kmem [<options>] {record|stat}

    -f, --force           don't complain, do it
    -i, --input <file>    input file name
    -l, --line <num>      show n lines
    -s, --sort <key[,key2...]>
                          sort by keys: ptr, callsite, bytes, hit, pingpong, frag, page, order, migtype, gfp
    -v, --verbose         be more verbose (show symbol address, etc)
        --alloc           show per-allocation statistics
        --caller          show per-callsite statistics
        --live            Show live page stat
        --page            Analyze page allocator
        --raw-ip          show raw ip instead of symbol
        --slab            Analyze slab allocator
        --time <str>      Time span of interest (start,stop)

首先用record命令记录kmem events

root: /5g_build/5g_Main/$ perf kmem record --force sleep 1
[ perf record: Woken up 0 times to write data ]
Warning:
186 out of order events recorded.
[ perf record: Captured and wrote 144.950 MB perf.data (1574498 samples) ]

再用perf kmem stat --caller命令查看: (--caller 显示每个调用点统计信息,--alloc 显示每次内存分配事件)

Warning:
186 out of order events recorded.
---------------------------------------------------------------------------------------------------------
 Callsite                           | Total_alloc/Per | Total_req/Per   | Hit      | Ping-pong | Frag
---------------------------------------------------------------------------------------------------------
 cgroup_pidlist_open+8b             |         8/8     |         4/4     |        1 |         0 | 50.000%
 audit_log_d_path+46                |     16384/8192  |      8214/4107  |        2 |         0 | 49.866%
 sk_prot_alloc+c3                   |   5310464/2048  |   2758952/1064  |     2593 |         5 | 48.047%
 pskb_expand_head+6a                |   5230592/1019  |   2948288/574   |     5132 |        15 | 43.634%
 show_uevent+7b                     |     53248/4096  |     30056/2312  |       13 |         0 | 43.555%
 alloc_pipe_info+a7                 |    475136/1024  |    296960/640   |      464 |       221 | 37.500%
 proc_reg_open+32                   |      1984/64    |      1240/40    |       31 |         0 | 37.500%
 alloc_cpumask_var_node+1f          |      4096/1024  |      2560/640   |        4 |         0 | 37.500%
 get_mountpoint+2f                  |       128/64    |        80/40    |        2 |         0 | 37.500%
 key_alloc+12d                      |         8/8     |         5/5     |        1 |         0 | 37.500%
 assoc_array_insert+56              |       512/512   |       328/328   |        1 |         0 | 35.938%
 cgroup_alloc_name.isra.22+1f       |        96/48    |        67/33    |        2 |         0 | 30.208%
 seq_open+106                       |    194880/192   |    138040/136   |     1015 |         0 | 29.167%
 hash_ipport_create+cd              |      1344/192   |       952/136   |        7 |         0 | 29.167%
 sys_name_to_handle_at+df           |       192/192   |       136/136   |        1 |         0 | 29.167%
 __alloc_skb+8d                     |  76957184/2568  |  54961280/1834  |    29962 |       245 | 28.582%
 ip_vs_add_dest+232                 |      7168/512   |      5264/376   |       14 |         0 | 26.562%
 alloc_pipe_info+46                 |     89088/192   |     66816/144   |      464 |       221 | 25.000%
 alloc_fdtable+57                   |     14528/64    |     10896/48    |      227 |       181 | 25.000%
 allocate_cg_links+55               |      5824/64    |      4368/48    |       91 |         0 | 25.000%
 proc_self_follow_link+66           |       384/16    |       288/12    |       24 |         0 | 25.000%
 kernfs_fop_open+1c8                |      1152/64    |       864/48    |       18 |         0 | 25.000%
 bitmap_port_create+a9              |       768/192   |       576/144   |        4 |         0 | 25.000%
 kstrdup_const+23                   |        32/8     |        24/6     |        4 |         0 | 25.000%
 ext4_readdir+716                   |       192/64    |       144/48    |        3 |         0 | 25.000%
 ___sys_sendmsg+21e                 |       384/192   |       288/144   |        2 |         0 | 25.000%
 hash_ipportip_create+cd            |       384/192   |       288/144   |        2 |         0 | 25.000%
 scm_fp_dup+28                      |        64/32    |        48/24    |        2 |         2 | 25.000%
 strndup_user+4b                    |        16/8     |        12/6     |        2 |         0 | 25.000%
 _autofs_dev_ioctl+c0               |        96/48    |        73/36    |        2 |         0 | 23.958%
 load_elf_interp.constprop.10+84    |    120320/512   |     92120/392   |      235 |         0 | 23.438%
 sget_userns+c2                     |      2048/2048  |      1600/1600  |        1 |         0 | 21.875%
 hash_ipportnet_create+d2           |       512/512   |       400/400   |        1 |         0 | 21.875%
 groups_alloc+34                    |       192/192   |       152/152