A survey of scheduler benchmarks
A survey of scheduler benchmarks [LWN.net]
https://lwn.net/Articles/809545/ Scheduling for the Android display pipeline 不同进程采取不同的调度策略
https://concordia-h2020.eu/eurosec-2021/PresentationEurosec2021/eurosec21-meltdown.pdf
https://lwn.net/Articles/818388/ Controlling realtime priorities in kernel threads
LWN:控制一下内核线程可以使用的实时调度优先级!_Linux News搬运工-CSDN博客
Linux进程管理 (9)实时调度类分析,以及FIFO和RR对比实验
schedstat的使用
https://lkml.org/lkml/2021/9/5/82 patch合入
Userspace Code Scope Profiler { user_func_abc(); <---- uprobe_scope_begin() get start schedstats ... user_func_xyz(); <---- uprobe_scope_end() get end schedstats } Then with the result of (end - begin) we can get below latency details in a specific user scope
CFS bandwidth controll
根据现象找到复现程序及对应的commit
CPU Throttling - Unthrottled: Fixing CPU Limits in the Cloud cpu份额使用限制,这里是阿里18年合入的patch,但是在实际功能中引入了新的问题。
2018年 4.18发布后不久,引入了此问题。
表现为:
1) web 应用 增加的 tail response times。
2) cpu使用率,也就是top的输出看起来一切OK。
通过进一步的分析,高响应时间和高cpu限流存在直接关联。
正常的cpu利用率不应该存在高限流。几乎所有的容器编排器都依赖于cgroup来管理资源的限制。当hard cpu 限制被设置在一个容器编排器时,
内核使用cfg bandwidth control来增强这些限制。
使用quota和period两个设置来管理 cpu的分配。
在目录 /sys/fs/cgroup/cpu,cpuacct/<container>
Quota and period settings are in cpu.cfs_quota_us and cpu.cfs_period_us
nr_periods – number of periods that any thread in the cgroup was runnable 在这个cgroup中的任何线程可以
运行的periods次数。是不是说运行完这些次数之后,这些个线程就不能再运行了?
nr_throttled – number of runnable periods in which the application used its entire quota
and was throttled periods的数目:表示某个进程在这些periods阶段已经用完了他的quota,并且被限流了。
实际也就此进程被限流的次数,被卡脖子的次数。可能是这个进程的地位比较低下,导致分配的quota比较低吧。
throttled_time – sum total amount of time individual threads within the cgroup were throttled
总的时间,cgroup里面的线程被限流的总时间。
在整个跟踪过程中,一位同事发现响应慢的进程,其nr_throttled比较大,也就是说,这些进程被限流的次数比较多。
nr_throttled/nr_periods得出一个百分比,称为限流百分比,来度量这个问题。我们不使用throttled_time(限流)时间这个字段。
Now, say we assign a CPU limit of .4 CPU to the application.
This means the application gets 40ms of run time for every 100ms period—even if the CPU has no other work to do
这个实在哪里设置?通过什么设置?
复现问题:
采用异步线程编程模式,每个线程只完成少量的工作。
设计: 快速线程,计算多次迭代的fabino计算。慢速,计算100次迭代,然后sleep 10ms
To the scheduler, these slow threads act much like asynchronous worker threads, in that they do a small amount of work and then block.
对于scheduler来说,这些慢线程就类似异步工作线程的工作模式:做很少量的工作,然后堵塞。
为复现高的限流率同时低的cpu利用率而设置的测试用例。
这个测试用例的限流配置数据是怎么样的?也就是 quota和period分别设置了多少 ?
修复这个限流BUG
CPU Throttling - Unthrottled: How a Valid Fix Becomes a Regression
如果您不熟悉调度过程,阅读内核文档可能会让您相信内核会跟踪使用的时间量。 相反,它跟踪仍然可用的时间量。 这是它的工作原理。
一个进程分配的时间量,内核不跟踪used,而是跟踪unused,为什么?
-----USED--------|---------UNUSED--------------
The kernel scheduler uses a global quota bucket located in cfs_bandwidth->quota. It allocates slices of this quota to each core (cfs_rq->runtime_remaining) on an as-needed basis. This slice amount defaults to 5ms, but you can tune it via the kernel.sched_cfs_bandwidth_slice_us sysctl tunable.
内核调度程序使用位于 cfs_bandwidth->quota 中的全局配额桶。 它根据需要将此配额的切片分配给每个core (cfs_rq->runtime_remaining)。 此切片数量默认为 5 毫秒,但您可以通过 kernel.sched_cfs_bandwidth_slice_us sysctl 可调参数对其进行调整。
If all threads in a cgroup stop being runnable on a particular CPU, such as blocking on IO, the kernel returns all but 1ms of this slack quota to the global bucket. The kernel leaves 1ms behind, because this decreases global bucket lock contention for many high performance computing applications. At the end of the period, the scheduler expires any remaining core-local time slice and refills the global quota bucket.
如果 cgroup 中的所有线程都停止在特定 CPU 上运行,例如在 IO 上阻塞,则内核会将除 1ms 之外的所有空闲配额返回到全局存储桶。 内核留下了 1ms,因为这减少了许多高性能计算应用程序的全局存储桶锁争用。 在该时间段结束时,调度程序使任何剩余的核心本地时间片到期并重新填充全局配额桶。
To clarify, here’s an example of a multi-threaded daemon with two worker threads, each pinned to their own core. The top graph shows the cgroup’s global quota over time. This starts with 20ms of quota, which correlates to .2 CPU. The middle graph shows the quota assigned to per-CPU queues, and the bottom graph shows when the workers were actually running on their CPU.
带有两个线程的daemon示例,每个线程单独占用一个CPU core。第一个张图显示了这个cgroup的全局份额的变化。从20ms开始,也就是说这两个线程组成的cgroup被分了每个period的20%。
中间的图片显示指派给每个cpu队列的quota,最后的图显示何时worker正在运行在cpu core上。
Time | Action |
10ms |
|
17ms |
|
10ms的时候,work1开始运行,一份配额从全局(此cgroup总共的quota)传递给per-cpu队列中。work1采用此5毫秒的配置来处理用户请求。为什么是5毫秒??
The chance that worker 1 takes precisely 5ms to respond to a request is incredibly unrealistic. What happens if the request requires some other amount of processing time?
worker 1 只需要 5 毫秒来响应请求的可能性是非常不现实的。 如果请求需要其他一些处理时间,会发生什么情况?
Time | Action |
30ms |
|
38ms |
|
41ms |
|
49ms |
|
While 1ms might not have much impact on a two-core machine, those milliseconds add up on high-core count machines. If we hit this behavior on an 88 core (n) machine, we could potentially strand 87 (n-1) milliseconds per period. That’s 87ms or 870 millicores or .87 CPU that could potentially be unusable. That’s how we hit low-quota usage with excessive throttling. Aha!
当CPU CORE增多时,闲置没有被使用的quota被同步增多,例如88 core,最终看到导致87%的cpu没有被利用。这也就看出来低CPU利用率但是高限流的情况。
note: If an application only has 100ms of quota (1 CPU), and the kernel uses 5ms slices, the application can only use 20 cores before running out of quota (100 ms / 5 ms slice = 20 slices). Any threads scheduled on the other 68 cores in an 88-core behemoth are then throttled and must wait for slack time to be returned to the global bucket before running.