s2.c

s3.c

s4.c

alu.c

nop.c

IPC，英文全称“Instruction Per Clock”，中文翻译过来就是每个时钟的指令，即CPU每一时钟周期内所执行的指令多少，IPC代表了一款CPU的设计架构，一旦该CPU设计完成之后，IPC值就不会再改变了。在这里，IPC值的高低起到了决定性的作用，而频率似乎不再高于一切。

CPU性能判断标准公式是CPU性能=IPC(CPU每一时钟周期内所执行的指令多少)×频率(MHz时钟速度)，这个公式最初由英特尔提出并被业界广泛认可的。IPC提升15%，CPU相同频率下性能提升15%，这就意味着，举个例子，如果同样是4.0GHz的主频，R7 3700比R7 2700强15%。--http://www.lotpc.com/yjzs/8463.html

一般来说IPC是越高越好, 这意味着单位时间执行了更多的指令, 通过观测IPC可以一定程度上了解软件的执行效率. 但是多高才算高呢? 这并没有标准答案, 它需要有基线进行对比, 有的代码逻辑就决定了不可能有太高的IPC, 比如存在大量的跳转逻辑或者随机访问, 当然这可能就是需要优化的地方.

代码目录结构

源码地址：https://github.com/Rtoax/test/tree/master/cpu/ipc/test-how

test-how/├── alu8.c├── alu.c├── compile.sh├── compile-sx.sh├── nop.c├── readme.txt├── s1.c├── s2.c├── s3.c└── s4.c
复制

compile-sx.sh

# cat compile-sx.sh #!/bin/bash gcc -O0 $* perf stat ./a.out
复制

compile.sh

# cat compile.sh #!/bin/bash gcc  $* perf stat ./a.out
复制

s1.c

void main() {    unsigned long sum = 0, i = 0;    for (i = 0; i < 0x10000000; i += 1) {        sum += i;    }}
复制

s2.c

void main() {    unsigned long sum = 0, a = 0, b = 0, c = 0, d = 0, i = 0;    for (i = 0; i < 0x10000000; i += 4) {        a += i; b += i + 1; c += i + 2; d += i + 3;    }    sum = a + b + c + d;}
复制

s3.c

void main() {    unsigned long sum = 0, a = 0, b = 0, c = 0, d = 0;    register unsigned long i = 0;    for (i = 0; i < 0x10000000; i += 4) {        a += i; b += i + 1; c += i + 2; d += i + 3;    }    sum = a + b + c + d;}
复制

s4.c

void main() {    register unsigned long sum = 0, a = 0, b = 0, c = 0, d = 0;    register unsigned long i = 0;    for (i = 0; i < 0x10000000; i += 4) {        a += i; b += i + 1; c += i + 2; d += i + 3;    }    sum = a + b + c + d;}
复制

alu.c

void main() {    while(1) {        __asm__ (            "movq $0x0,%rax\n\t"            "movq $0xa,%rbx\n\t"            "andq $0x12345678,%rbx\n\t"            "orq  $0x12345678,%rbx\n\t"            "shlq $0x2,%rbx\n\t"            "addq %rbx,%rax\n\t"            "subq $0x14,%rax\n\t"            "movq %rax,%rcx");    }}
复制

nop.c

void main() {    while(1) {        __asm__ ("nop\n\t"              	"nop\n\t"                /* 这里需要补齐128个nop，详情请见GitHub源码 */                "nop");    }}
复制

alu8.c

void main() {    while(1) {        __asm__ (            "movq $0x0,%rax\n\t"            "movq $0xa,%rbx\n\t"            "andq $0x12345678,%rbx\n\t"            "shlq $0x2,%rbx\n\t"            "addq %rbx,%rax\n\t"            "subq $0x14,%rax\n\t"            "movq %rax,%rcx");    }}
复制

性能测试

s1.c

# ./compile-sx.sh s1.c   Performance counter stats for './a.out':         696.193309      task-clock (msec)         #    0.998 CPUs utilized                          13      context-switches          #    0.019 K/sec                                   0      cpu-migrations            #    0.000 K/sec                                 114      page-faults               #    0.164 K/sec                       2,325,600,151      cycles                    #    3.340 GHz                         1,345,561,852      instructions              #    0.58  insn per cycle                269,056,226      branches                  #  386.468 M/sec                              29,293      branch-misses             #    0.01% of all branches                0.697566623 seconds time elapsed
复制

s2.c

# ./compile-sx.sh s2.c   Performance counter stats for './a.out':         197.554653      task-clock (msec)         #    0.997 CPUs utilized                           1      context-switches          #    0.005 K/sec                                   0      cpu-migrations            #    0.000 K/sec                                 115      page-faults               #    0.582 K/sec                         662,040,633      cycles                    #    3.351 GHz                         1,343,510,723      instructions              #    2.03  insn per cycle                 67,353,710      branches                  #  340.937 M/sec                              11,076      branch-misses             #    0.02% of all branches                0.198063870 seconds time elapsed
复制

不过指令条数基本上没有变化, 如果再看汇编代码, 就会发现-O0编译出来的代码还有很多访存, 那么我们现在稍微修改一下, 使用register来存放变量i:

s3.c

# ./compile-sx.sh s3.c   Performance counter stats for './a.out':         130.640582      task-clock (msec)         #    0.996 CPUs utilized                           0      context-switches          #    0.000 K/sec                                   0      cpu-migrations            #    0.000 K/sec                                 115      page-faults               #    0.880 K/sec                         433,319,129      cycles                    #    3.317 GHz                         1,074,802,878      instructions              #    2.48  insn per cycle                 67,303,874      branches                  #  515.184 M/sec                               9,671      branch-misses             #    0.01% of all branches                0.131150296 seconds time elapsed
复制

再进一步, 所有变量都使用register:

s4.c

# ./compile-sx.sh s4.c   Performance counter stats for './a.out':          64.823574      task-clock (msec)         #    0.992 CPUs utilized                           1      context-switches          #    0.015 K/sec                                   0      cpu-migrations            #    0.000 K/sec                                 115      page-faults               #    0.002 M/sec                         215,723,116      cycles                    #    3.328 GHz                           604,800,935      instructions              #    2.80  insn per cycle                 67,259,662      branches                  # 1037.580 M/sec                               8,069      branch-misses             #    0.01% of all branches                0.065371469 seconds time elapsed
复制

到这里我们已经拿到了一个相对满意的结果, 是否还有优化的空间我们可以一起思考.

那么IPC到底说明了什么? 它从某一个侧面说明了CPU的执行效率, 却也不是全部. 想要提高应用的效率, 注意不是CPU的效率, 简单地说无非两点:

没必要的事情不做
必须做的事情做得更高效, 这个是IPC可以发挥的地方

既然IPC可以接近3, 那么还能不能再高点? 我们看2个测试, alu.c 和 nop.c, 测试运行在Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

alu.c

# ./compile.sh alu.c ^C./a.out: 中断  Performance counter stats for './a.out':        2338.321843      task-clock (msec)         #    0.999 CPUs utilized                          14      context-switches          #    0.006 K/sec                                   0      cpu-migrations            #    0.000 K/sec                                 113      page-faults               #    0.048 K/sec                       7,807,833,575      cycles                    #    3.339 GHz                        28,947,455,798      instructions              #    3.71  insn per cycle              3,217,085,943      branches                  # 1375.810 M/sec                              60,083      branch-misses             #    0.00% of all branches                2.340188767 seconds time elapsed
复制

nop.c

# ./compile.sh nop.c ^C./a.out: 中断  Performance counter stats for './a.out':        1556.110089      task-clock (msec)         #    0.999 CPUs utilized                           5      context-switches          #    0.003 K/sec                                   0      cpu-migrations            #    0.000 K/sec                                 113      page-faults               #    0.073 K/sec                       5,189,621,889      cycles                    #    3.335 GHz                        18,438,724,412      instructions              #    3.55  insn per cycle                144,097,748      branches                  #   92.601 M/sec                              42,524      branch-misses             #    0.03% of all branches                1.557307312 seconds time elapsed
复制

通过这2个测试可以看到, IPC甚至可以接近4, 同时也产生了几个疑问:

3.84应该不是极限, 至少应该是个整数吧?
alu比nop还高, 这似乎不符合常理?
alu中的很多指令有依赖关系, 怎么达到高并发的?

首先来看第一个问题, 为什么是3.84, 而不是4或者5呢? 这里面第一个需要关注的地方就是while(1), 相对于其他move/and/or/shl/sub指令, 它是一个branch指令. CPU对branch的支持肯定会复杂一点, 碰到branch指令还会prefetch之后的指令吗? 如果branch taken了那之前的prefetch不就没用了? 另一个需要考虑的就是Broadwell的每个core里面只有4个ALU, 其中只有2个ALU能够执行跳转指令, 并且每个cycle最多能够dispatch 4个micro ops. 而alu.c中每个循环是8条指令, 加上跳转指令本身有9条指令, 看起来不是最好的情况. 那么在循环中减少一条指令会怎么样:

alu8.c

# ./compile.sh alu8.c ^C./a.out: 中断  Performance counter stats for './a.out':        2131.810701      task-clock (msec)         #    1.000 CPUs utilized                           5      context-switches          #    0.002 K/sec                                   0      cpu-migrations            #    0.000 K/sec                                 113      page-faults               #    0.053 K/sec                       7,134,964,787      cycles                    #    3.347 GHz                        27,537,819,945      instructions              #    3.86  insn per cycle              3,442,735,496      branches                  # 1614.935 M/sec                              48,328      branch-misses             #    0.00% of all branches                2.132667345 seconds time elapsed
复制

可以看到IPC已经达到3.99, 非常接近4了. 如果把每个循环的指令条数修改为12 (包括跳转指令), 16, 20等都可以验证IPC在3.99左右, 反之如果是13, 14就差一点. 唯一的例外来自于7, 它同样能达到3.99 (原因?), 再减少到6又差点.

参考

《IPC到底能有多高》https://zhuanlan.zhihu.com/p/138887210

《Instruction Level Parallelism》http://web.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/instrLevParal.html

《Instruction Level Parallelism PDF》https://eecs.ceas.uc.edu/~wilseypa/classes/eecs7095/lectureNotes/ilp/ilp.pdf

《Instruction Level Parallelism PDF》https://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/ilp5.pdf

</article>
复制

[转帖]CPU的IPC调优：通过优化代码，提高每个时钟的指令数

小编点评

正文

代码目录结构

compile-sx.sh

compile.sh

s1.c

s2.c

s3.c

s4.c

alu.c

nop.c

alu8.c

性能测试

s1.c

s2.c

s3.c

s4.c

alu.c

nop.c

alu8.c

参考

与[转帖]CPU的IPC调优：通过优化代码，提高每个时钟的指令数相似的内容：

[转帖]CPU的IPC调优：通过优化代码，提高每个时钟的指令数

[转帖]CPU的IPC性能是什么意思？通俗易懂科普处理器IPC性能含义

[转帖]Perf IPC以及CPU性能

[转帖]Perf IPC以及CPU性能

[转帖]CPU的制造和概念

[转帖]CPU、寄存器、内存、磁盘、网络性能分析

[转帖]CPU结构对Redis性能的影响

[转帖]CPU结构对Redis性能的影响

[转帖]CPU结构对Redis性能的影响

[转帖]CPU性能监控之一------CPU架构

# 热门排行