[转帖]上下文切换的代价

上下文,切换,代价 · 浏览次数 : 0

小编点评

# CPU 密集性分析 **4核心机器下PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND** **11586 admin 20 0 12.9g 6.9g 22976 R 93.7 45.6 0:33.07 Root-Worke** **11587 admin 20 0 12.9g 6.9g 22976 R 93.7 45.6 0:34.29 Root-Worke11587 admin 20 0 12.9g 6.9g 22976 R 93.7 45.6 0:32.58 Root-Worke11585 admin 20 0 12.9g 6.9g 22976 R 92.0 45.6 0:33.25 Root-Worke没开协程CPU有20%闲置打不上去,开了协程后CPU 跑到95%结论进程上下文切换需要几千纳秒(不同CPU型号会有差异)如果做taskset 那么上下文切换会减少50%的时间(避免了L1、L2 Miss等)线程比进程上下文切换略快10%左右测试数据和实际运行场景相关很大,比较难以把控,CPU竞争太激烈容易把等待调度时间计入;如果CPU比较闲体现不出cache miss等导致的时延加剧系统调用相对进程上下文切换就很轻了,大概100ns以内函数调用更轻,大概几个ns,压栈跳转CPU的超线程调度和函数调用差不多,都是几个ns可以搞定看完这些数据再想想协程是在做什么、为什么效率高就很自然 登录。 **归纳总结:** * CPU 密集性分析需要带简单的排版 * CPU效率高主要取决于上下文切换 * CPU上下文切换需要几千纳秒 * CPU比较闲体现不出cache miss等导致的时延加剧系统调用相对进程上下文切换很轻

正文

https://plantegg.github.io/2022/06/05/%E4%B8%8A%E4%B8%8B%E6%96%87%E5%88%87%E6%8D%A2%E5%BC%80%E9%94%80/

 

概念

进程切换、软中断、内核态用户态切换、CPU超线程切换

内核态用户态切换:还是在一个线程中,只是由用户态进入内核态为了安全等因素需要更多的指令,系统调用具体多做了啥请看:https://github.com/torvalds/linux/blob/v5.2/arch/x86/entry/entry_64.S#L145

软中断:比如网络包到达,触发ksoftirqd(每个核一个)进程来处理,是进程切换的一种

进程切换是里面最重的,少不了上下文切换,代价还有进程阻塞唤醒调度。另外进程切换有主动让出CPU的切换、也有时间片用完后被切换

CPU超线程切换:最轻,发生在CPU内部,OS、应用都无法感知

多线程调度下的热点火焰图:

image.png

上下文切换后还会因为调度的原因导致线程卡顿更久

Linux 内核进程调度时间片一般是HZ的倒数,HZ在编译的时候一般设置为1000,倒数也就是1ms,也就是每个进程的时间片是1ms(早年是10ms–HZ 为100的时候),如果进程1阻塞让出CPU进入调度队列,这个时候调度队列前还有两个进程2/3在排队,也就是最差会在2ms后才轮到1被调度执行。负载决定了排队等待调度队列的长短,如果轮到调度的进程已经ready那么性能没有浪费,反之如果轮到被调度但是没有ready(比如网络回包没到达)相当浪费了一次调度

sched_min_granularity_ns is the most prominent setting. In the original sched-design-CFS.txt this was described as the only “tunable” setting, “to tune the scheduler from ‘desktop’ (low latencies) to ‘server’ (good batching) workloads.”

In other words, we can change this setting to reduce overheads from context-switching, and therefore improve throughput at the cost of responsiveness (“latency”).

The CFS setting as mimicking the previous build-time setting, CONFIG_HZ. In the first version of the CFS code, the default value was 1 ms, equivalent to 1000 Hz for “desktop” usage. Other supported values of CONFIG_HZ were 250 Hz (the default), and 100 Hz for the “server” end. 100 Hz was also useful when running Linux on very slow CPUs, this was one of the reasons given when CONFIG_HZ was first added as an build setting on X86.

或者参数调整:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#sysctl -a |grep -i sched_ |grep -v cpu
kernel.sched_autogroup_enabled = 0
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_cfs_bw_burst_enabled = 1
kernel.sched_cfs_bw_burst_onset_percent = 0
kernel.sched_child_runs_first = 0
kernel.sched_latency_ns = 24000000
kernel.sched_migration_cost_ns = 500000
kernel.sched_min_granularity_ns = 3000000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 1
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 4000000

测试

How long does it take to make a context switch?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
model name : Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
2 physical CPUs, 26 cores/CPU, 2 hardware threads/core = 104 hw threads total
-- No CPU affinity --
10000000 system calls in 1144720626ns (114.5ns/syscall)
2000000 process context switches in 6280519812ns (3140.3ns/ctxsw)
2000000 thread context switches in 6417846724ns (3208.9ns/ctxsw)
2000000 thread context switches in 147035970ns (73.5ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 1109675081ns (111.0ns/syscall)
2000000 process context switches in 4204573541ns (2102.3ns/ctxsw)
2000000 thread context switches in 2740739815ns (1370.4ns/ctxsw)
2000000 thread context switches in 474815006ns (237.4ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 1039827099ns (104.0ns/syscall)
2000000 process context switches in 5622932975ns (2811.5ns/ctxsw)
2000000 thread context switches in 5697704164ns (2848.9ns/ctxsw)
2000000 thread context switches in 143474146ns (71.7ns/ctxsw)
----------
model name : Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
2 physical CPUs, 16 cores/CPU, 2 hardware threads/core = 64 hw threads total
-- No CPU affinity --
10000000 system calls in 772827735ns (77.3ns/syscall)
2000000 process context switches in 4009838007ns (2004.9ns/ctxsw)
2000000 thread context switches in 5234823470ns (2617.4ns/ctxsw)
2000000 thread context switches in 193276269ns (96.6ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 746578449ns (74.7ns/syscall)
2000000 process context switches in 3598569493ns (1799.3ns/ctxsw)
2000000 thread context switches in 2475733882ns (1237.9ns/ctxsw)
2000000 thread context switches in 381484302ns (190.7ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 746674401ns (74.7ns/syscall)
2000000 process context switches in 4129856807ns (2064.9ns/ctxsw)
2000000 thread context switches in 4226458450ns (2113.2ns/ctxsw)
2000000 thread context switches in 193047255ns (96.5ns/ctxsw)
---------
model name : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
2 physical CPUs, 24 cores/CPU, 2 hardware threads/core = 96 hw threads total
-- No CPU affinity --
10000000 system calls in 765013680ns (76.5ns/syscall)
2000000 process context switches in 5906908170ns (2953.5ns/ctxsw)
2000000 thread context switches in 6741875538ns (3370.9ns/ctxsw)
2000000 thread context switches in 173271254ns (86.6ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 764139687ns (76.4ns/syscall)
2000000 process context switches in 4040915457ns (2020.5ns/ctxsw)
2000000 thread context switches in 2327904634ns (1164.0ns/ctxsw)
2000000 thread context switches in 378847082ns (189.4ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 762375921ns (76.2ns/syscall)
2000000 process context switches in 5827318932ns (2913.7ns/ctxsw)
2000000 thread context switches in 6360562477ns (3180.3ns/ctxsw)
2000000 thread context switches in 173019064ns (86.5ns/ctxsw)
--------ECS
model name : Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
1 physical CPUs, 2 cores/CPU, 2 hardware threads/core = 4 hw threads total
-- No CPU affinity --
10000000 system calls in 561242906ns (56.1ns/syscall)
2000000 process context switches in 3025706345ns (1512.9ns/ctxsw)
2000000 thread context switches in 3333843503ns (1666.9ns/ctxsw)
2000000 thread context switches in 145410372ns (72.7ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 586742944ns (58.7ns/syscall)
2000000 process context switches in 2369203084ns (1184.6ns/ctxsw)
2000000 thread context switches in 1929627973ns (964.8ns/ctxsw)
2000000 thread context switches in 335827569ns (167.9ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 630259940ns (63.0ns/syscall)
2000000 process context switches in 3027444795ns (1513.7ns/ctxsw)
2000000 thread context switches in 3172677638ns (1586.3ns/ctxsw)
2000000 thread context switches in 144168251ns (72.1ns/ctxsw)
---------kupeng 920
2 physical CPUs, 96 cores/CPU, 1 hardware threads/core = 192 hw threads total
-- No CPU affinity --
10000000 system calls in 1216730780ns (121.7ns/syscall)
2000000 process context switches in 4653366132ns (2326.7ns/ctxsw)
2000000 thread context switches in 4689966324ns (2345.0ns/ctxsw)
2000000 thread context switches in 167871167ns (83.9ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 1220106854ns (122.0ns/syscall)
2000000 process context switches in 3420506934ns (1710.3ns/ctxsw)
2000000 thread context switches in 2962106029ns (1481.1ns/ctxsw)
2000000 thread context switches in 543325133ns (271.7ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 1216466158ns (121.6ns/syscall)
2000000 process context switches in 2797948549ns (1399.0ns/ctxsw)
2000000 thread context switches in 3119316050ns (1559.7ns/ctxsw)
2000000 thread context switches in 167728516ns (83.9ns/ctxsw)

测试代码仓库:https://github.com/tsuna/contextswitch

Source code: timectxsw.c Results:

  • Intel 5150: ~4300ns/context switch
  • Intel E5440: ~3600ns/context switch
  • Intel E5520: ~4500ns/context switch
  • Intel X5550: ~3000ns/context switch
  • Intel L5630: ~3000ns/context switch
  • Intel E5-2620: ~3000ns/context switch

如果绑核后上下文切换能提速在66-45%之间

系统调用代价

Source code: timesyscall.c Results:

  • Intel 5150: 105ns/syscall
  • Intel E5440: 87ns/syscall
  • Intel E5520: 58ns/syscall
  • Intel X5550: 52ns/syscall
  • Intel L5630: 58ns/syscall
  • Intel E5-2620: 67ns/syscall

https://mp.weixin.qq.com/s/uq5s5vwk5vtPOZ30sfNsOg 进程/线程切换究竟需要多少开销?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/*
创建两个进程并在它们之间传送一个令牌。其中一个进程在读取令牌时就会引起阻塞。另一个进程发送令牌后等待其返回时也处于阻塞状态。如此往返传送一定的次数,然后统计他们的平均单次切换时间开销
代码来自:https://www.jianshu.com/p/be3250786a91
*/
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#include <sched.h>
#include <sys/types.h>
#include <unistd.h> //pipe()
 
int main()
{
int x, i, fd[2], p[2];
char send = 's';
char receive;
pipe(fd);
pipe(p);
struct timeval tv;
struct sched_param param;
param.sched_priority = 0;
 
while ((x = fork()) == -1);
if (x==0) {
sched_setscheduler(getpid(), SCHED_FIFO, &param);
gettimeofday(&tv, NULL);
printf("Before Context Switch Time%u s, %u us\n", tv.tv_sec, tv.tv_usec);
for (i = 0; i < 10000; i++) {
read(fd[0], &receive, 1);
write(p[1], &send, 1);
}
exit(0);
}
else {
sched_setscheduler(getpid(), SCHED_FIFO, &param);
for (i = 0; i < 10000; i++) {
write(fd[1], &send, 1);
read(p[0], &receive, 1);
}
gettimeofday(&tv, NULL);
printf("After Context SWitch Time%u s, %u us\n", tv.tv_sec, tv.tv_usec);
}
return 0;
}

平均每次上下文切换耗时3.5us左右

软中断开销计算

下面的计算方法比较糙,仅供参考。压力越大,一次软中断需要处理的网络包数量就越多,消耗的时间越长。如果包数量太少那么测试干扰就太严重了,数据也不准确。

测试机将收发队列设置为1,让所有软中断交给一个core来处理。

无压力时 interrupt大概4000,然后故意跑压力,CPU跑到80%,通过vmstat和top查看:

1
2
3
4
5
6
7
$vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
19 0 0 174980 151840 3882800 0 0 0 11 1 1 1 0 99 0 0
11 0 0 174820 151844 3883668 0 0 0 0 30640 113918 59 22 20 0 0
9 0 0 175952 151852 3884576 0 0 0 224 29611 108549 57 22 21 0 0
11 0 0 171752 151852 3885636 0 0 0 3452 30682 113874 57 22 21 0 0

top看到 si% 大概为20%,也就是一个核25000个interrupt需要消耗 20% 的CPU, 说明这些软中断消耗了200毫秒

200*1000微秒/25000=200/25=8微秒,8000纳秒 – 偏高

降低压力CPU 跑到55% si消耗12%

1
2
3
4
5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 0 0 174180 152076 3884360 0 0 0 0 25314 119681 40 17 43 0 0
1 0 0 172600 152080 3884308 0 0 0 252 24971 116407 40 17 43 0 0
4 0 0 174664 152080 3884540 0 0 0 3536 25164 118175 39 18 42 0 0

120*1000微秒/(21000)=5.7微秒, 5700纳秒 – 偏高

降低压力(4核CPU只压到15%)

1
2
3
4
5
6
7
8
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 183228 151788 3876288 0 0 0 0 15603 42460 6 3 91 0 0
0 0 0 181312 151788 3876032 0 0 0 0 15943 43129 7 2 91 0 0
1 0 0 181728 151788 3876544 0 0 0 3232 15790 42409 7 3 90 0 0
0 0 0 181584 151788 3875956 0 0 0 0 15728 42641 7 3 90 0 0
1 0 0 179276 151792 3876848 0 0 0 192 15862 42875 6 3 91 0 0
0 0 0 179508 151796 3876424 0 0 0 0 15404 41899 7 2 91 0 0

单核11000 interrupt,对应 si CPU 2.2%

22*1000/11000= 2微秒 2000纳秒 略微靠谱

超线程切换开销

最小,基本可以忽略,1ns以内

lmbench测试工具

lmbench的lat_ctx等,单位是微秒,压力小的时候一次进程的上下文是1540纳秒

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@plantegg 13:19 /root/lmbench3]
#taskset -c 4 ./bin/lat_ctx -P 2 -W warmup -s 64 2 //CPU 打满
"size=64k ovr=3.47
2 7.88
 
#taskset -c 4 ./bin/lat_ctx -P 1 -W warmup -s 64 2
"size=64k ovr=3.46
2 1.54
 
#taskset -c 4-5 ./bin/lat_ctx -W warmup -s 64 2
"size=64k ovr=3.44
2 3.11
 
#taskset -c 4-7 ./bin/lat_ctx -P 2 -W warmup -s 64 2 //CPU 打到50%
"size=64k ovr=3.48
2 3.14
 
#taskset -c 4-15 ./bin/lat_ctx -P 3 -W warmup -s 64 2
"size=64k ovr=3.46
2 3.18

协程对性能的影响

将WEB服务改用协程调度后,TPS提升50%(30000提升到45000),而contextswitch数量从11万降低到8000(无压力的cs也有4500)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 0 0 3831480 153136 3819244 0 0 0 0 23599 6065 79 19 2 0 0
4 0 0 3829208 153136 3818824 0 0 0 160 23324 7349 80 18 2 0 0
4 0 0 3833320 153140 3818672 0 0 0 0 24567 8213 80 19 2 0 0
4 0 0 3831880 153140 3818532 0 0 0 0 24339 8350 78 20 2 0 0
 
[ 99s] threads: 60, tps: 0.00, reads/s: 44609.77, writes/s: 0.00, response time: 2.05ms (95%)
[ 100s] threads: 60, tps: 0.00, reads/s: 46538.27, writes/s: 0.00, response time: 1.99ms (95%)
[ 101s] threads: 60, tps: 0.00, reads/s: 46061.84, writes/s: 0.00, response time: 2.01ms (95%)
[ 102s] threads: 60, tps: 0.00, reads/s: 46961.05, writes/s: 0.00, response time: 1.94ms (95%)
[ 103s] threads: 60, tps: 0.00, reads/s: 46224.15, writes/s: 0.00, response time: 2.00ms (95%)
[ 104s] threads: 60, tps: 0.00, reads/s: 46556.93, writes/s: 0.00, response time: 1.98ms (95%)
[ 105s] threads: 60, tps: 0.00, reads/s: 45965.12, writes/s: 0.00, response time: 1.97ms (95%)
[ 106s] threads: 60, tps: 0.00, reads/s: 46369.96, writes/s: 0.00, response time: 2.01ms (95%)
 
//4core 机器下
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11588 admin 20 0 12.9g 6.9g 22976 R 95.7 45.6 0:33.07 Root-Worke //四个协程把CPU基本跑满
11586 admin 20 0 12.9g 6.9g 22976 R 93.7 45.6 0:34.29 Root-Worke
11587 admin 20 0 12.9g 6.9g 22976 R 93.7 45.6 0:32.58 Root-Worke
11585 admin 20 0 12.9g 6.9g 22976 R 92.0 45.6 0:33.25 Root-Worke

没开协程CPU有20%闲置打不上去,开了协程后CPU 跑到95%

结论

  • 进程上下文切换需要几千纳秒(不同CPU型号会有差异)
  • 如果做taskset 那么上下文切换会减少50%的时间(避免了L1、L2 Miss等)
  • 线程比进程上下文切换略快10%左右
  • 测试数据和实际运行场景相关很大,比较难以把控,CPU竞争太激烈容易把等待调度时间计入;如果CPU比较闲体现不出cache miss等导致的时延加剧
  • 系统调用相对进程上下文切换就很轻了,大概100ns以内
  • 函数调用更轻,大概几个ns,压栈跳转
  • CPU的超线程调度和函数调用差不多,都是几个ns可以搞定

看完这些数据再想想协程是在做什么、为什么效率高就很自然的了

与[转帖]上下文切换的代价相似的内容:

[转帖]上下文切换的代价

https://plantegg.github.io/2022/06/05/%E4%B8%8A%E4%B8%8B%E6%96%87%E5%88%87%E6%8D%A2%E5%BC%80%E9%94%80/ 概念 进程切换、软中断、内核态用户态切换、CPU超线程切换 内核态用户态切换:还是在一个线程中

【转帖】BGP:全穿透,半穿透,静态代播有什么区别

一. 什么是BGP二. 具体实现方案 2.1BGP的优点 2.2 真伪BGP在使用效果上有什么差异​​​​​​​ ​​​​​​​ 2.2.1 真BGP实现了用户最佳路径的自动选择​​​​​​​​ ​​​​​​​ 2.2.2 伪BGP不具备真BGP动态最佳路径切换​​​​​​​​ ​​​​​​​ 2.

[转帖]cpu上下文切换

https://www.jianshu.com/p/280778660ffc cpu上下文切换,就是先把前一个任务的cpu上下文(cpu寄存器和程序计数器)保存起来,然后加载新任务的上下文到这些寄存器和程序计数器,最后跳转到程序计数器所指的新位置,运行新任务。根据任务的不同,cpu上下文切换可以分为

[转帖]何为真正的零拷贝

传统的文件传输有啥缺点? 传统IO的工作方式是,数据读取和写入是从用户空间和内核空间来回复制,内核空间的数据时通过操作系统层面的IO接口从磁盘读取或写入。 通过上图可以看出,在我们执行read和writer之间,一共发生了4次用户态和内核态上下文切换,在高并发的场景下,用户态和内核态上下文切换带来的

[转帖]协程切换和线程切换

先说结论:协程切换比线程切换快主要有两点: (1)协程切换完全在用户空间进行,线程切换涉及特权模式切换,需要在内核空间完成;(2)协程切换相比线程切换做的事情更少。 协程切换只涉及基本的CPU上下文切换,所谓的 CPU 上下文,就是一堆寄存器,里面保存了 CPU运行任务所需要的信息:从哪里开始运行(

[转帖]如何分析上下文切换问题

https://www.jianshu.com/p/40188dff99e9 vmstat vmstat:是一个常用的系统性能分析工具,主要用来分析系统的内存使用情况,也常用来分析cpu上下文切换和中断次数 [wanchao@localhost ~]$ vmstat 5 procs memory s

[转帖]如何分析上下文切换问题

https://www.jianshu.com/p/40188dff99e9 vmstat vmstat:是一个常用的系统性能分析工具,主要用来分析系统的内存使用情况,也常用来分析cpu上下文切换和中断次数 [wanchao@localhost ~]$ vmstat 5 procs memory s

[转帖]通过实战理解CPU上下文切换

Linux是一个多任务的操作系统,可以支持远大于CPU数量的任务同时运行,但是我们都知道这其实是一个错觉,真正是系统在很短的时间内将CPU轮流分配给各个进程,给用户造成多任务同时运行的错觉。所以这就是有一个问题,在每次运行进程之前CPU都需要知道进程从哪里加载、从哪里运行,也就是说需要系统提前帮它设

[转帖]Linux性能分析(二):理解CPU上下文切换

在计算机中,上下文切换是指存储进程或线程的状态,以便以后可以还原它并从同一点恢复执行。这允许多个进程共享一个CPU,这是多任务操作系统的基本功能。 Linux 是一个多任务操作系统,它支持远大于 CPU 数量的任务同时运行,这依赖于CPU上下文切换。CPU 上下文切换,就是先把前一个任务的 CPU

[转帖]Linux—CPU核数、上下文切换介绍及pidstat等命令详解

https://www.jianshu.com/p/0ae0c1153c34 关注:CodingTechWork,一起学习进步。 引言 并发编程 并发编程的目的是为了改善串行程序执行慢问题,但是,并不是启动更多线程就能够让程序执行更快。因为在并发时,容易受到软硬件资源等限制,从而导致上下文切换慢,频