[转帖]简单理解Linux的Memory Overcommit

简单,理解,linux,memory,overcommit · 浏览次数 : 0

小编点评

内容生成时需要带简单的排版，以下内容需带简单的排版： 1. **vmstat -a procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free inact active si so bi bo in cs us sy id wa st 1** 2. **如果inactive list很大，表明在必要时可以回收的页面很多；而如果inactive list很小，说明可以回收的页面不多** 3. **Active/inactive memory是针对用户进程所占用的内存而言的，内核占用的内存（包括slab）不在其中**

正文

https://zhuanlan.zhihu.com/p/551677956
复制

Memory Overcommit的意思是操作系统承诺给进程的内存大小超过了实际可用的内存。一个保守的操作系统不会允许memory overcommit，有多少就分配多少，再申请就没有了，这其实有些浪费内存，因为进程实际使用到的内存往往比申请的内存要少，比如某个进程malloc()了200MB内存，但实际上只用到了100MB，按照UNIX/Linux的算法，物理内存页的分配发生在使用的瞬间，而不是在申请的瞬间，也就是说未用到的100MB内存根本就没有分配，这100MB内存就闲置了。下面这个概念很重要，是理解memory overcommit的关键：commit(或overcommit)针对的是内存申请，内存申请不等于内存分配，内存只在实际用到的时候才分配。

Linux是允许memory overcommit的，只要你来申请内存我就给你，寄希望于进程实际上用不到那么多内存，但万一用到那么多了呢？那就会发生类似“银行挤兑”的危机，现金(内存)不足了。Linux设计了一个OOM killer机制(OOM = out-of-memory)来处理这种危机：挑选一个进程出来杀死，以腾出部分内存，如果还不够就继续杀…也可通过设置内核参数 vm.panic_on_oom 使得发生OOM时自动重启系统。这都是有风险的机制，重启有可能造成业务中断，杀死进程也有可能导致业务中断，我自己的这个小网站就碰到过这种问题。所以Linux 2.6之后允许通过内核参数 vm.overcommit_memory 禁止memory overcommit。

内核参数 vm.overcommit_memory 接受三种取值：

0 – Heuristic overcommit handling. 这是缺省值，它允许overcommit，但过于明目张胆的overcommit会被拒绝，比如malloc一次性申请的内存大小就超过了系统总内存。Heuristic的意思是“试探式的”，内核利用某种算法（对该算法的详细解释请看文末）猜测你的内存申请是否合理，它认为不合理就会拒绝overcommit。
1 – Always overcommit. 允许overcommit，对内存申请来者不拒。
2 – Don’t overcommit. 禁止overcommit。

关于禁止overcommit (vm.overcommit_memory=2) ，需要知道的是，怎样才算是overcommit呢？kernel设有一个阈值，申请的内存总数超过这个阈值就算overcommit，在/proc/meminfo中可以看到这个阈值的大小：

# grep -i commit /proc/meminfo
CommitLimit:     5967744 kB
Committed_AS:    5363236 kB
复制

CommitLimit 就是overcommit的阈值，申请的内存总数超过CommitLimit的话就算是overcommit。
这个阈值是如何计算出来的呢？它既不是物理内存的大小，也不是free memory的大小，它是通过内核参数vm.overcommit_ratio或vm.overcommit_kbytes间接设置的，公式如下：
【CommitLimit = (Physical RAM * vm.overcommit_ratio / 100) + Swap】

注：
vm.overcommit_ratio 是内核参数，缺省值是50，表示物理内存的50%。如果你不想使用比率，也可以直接指定内存的字节数大小，通过另一个内核参数 vm.overcommit_kbytes 即可；
如果使用了huge pages，那么需要从物理内存中减去，公式变成：
CommitLimit = ([total RAM] – [total huge TLB RAM]) * vm.overcommit_ratio / 100 + swap
可参考：https://access.redhat.com/solutions/665023

/proc/meminfo中的 Committed_AS 表示所有进程已经申请的内存总大小，（注意是已经申请的，不是已经分配的），如果 Committed_AS 超过 CommitLimit 就表示发生了 overcommit，超出越多表示 overcommit 越严重。Committed_AS 的含义换一种说法就是，如果要绝对保证不发生OOM (out of memory) 需要多少物理内存。

“sar -r”是查看内存使用状况的常用工具，它的输出结果中有两个与overcommit有关，kbcommit 和 %commit：
kbcommit对应/proc/meminfo中的 Committed_AS；
%commit的计算公式并没有采用 CommitLimit作分母，而是Committed_AS/(MemTotal+SwapTotal)，意思是_内存申请_占_物理内存与交换区之和_的百分比。

$ sar -r 
 
05:00:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
05:10:01 PM    160576   3648460     95.78         0   1846212   4939368     62.74   1390292   1854880    
复制

附：对Heuristic overcommit算法的解释

内核参数 vm.overcommit_memory 的值0，1，2对应的源代码如下，其中heuristic overcommit对应的是OVERCOMMIT_GUESS：

源文件：source/include/linux/mman.h
 
#define OVERCOMMIT_GUESS                0
#define OVERCOMMIT_ALWAYS               1
#define OVERCOMMIT_NEVER                2
复制

Heuristic overcommit算法在以下函数中实现，基本上可以这么理解：
单次申请的内存大小不能超过【free memory + free swap + pagecache的大小 + SLAB中可回收的部分】，否则本次申请就会失败。

源文件：source/mm/mmap.c 以RHEL内核2.6.32-642为例
 
0120 /*
0121  * Check that a process has enough memory to allocate a new virtual
0122  * mapping. 0 means there is enough memory for the allocation to
0123  * succeed and -ENOMEM implies there is not.
0124  *
0125  * We currently support three overcommit policies, which are set via the
0126  * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
0127  *
0128  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
0129  * Additional code 2002 Jul 20 by Robert Love.
0130  *
0131  * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
0132  *
0133  * Note this is a helper function intended to be used by LSMs which
0134  * wish to use this logic.
0135  */
0136 int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
0137 {
0138         unsigned long free, allowed;
0139 
0140         vm_acct_memory(pages);
0141 
0142         /*
0143          * Sometimes we want to use more memory than we have
0144          */
0145         if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
0146                 return 0;
0147 
0148         if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) { //Heuristic overcommit算法开始
0149                 unsigned long n;
0150 
0151                 free = global_page_state(NR_FILE_PAGES); //pagecache汇总的页面数量
0152                 free += get_nr_swap_pages(); //free swap的页面数
0153 
0154                 /*
0155                  * Any slabs which are created with the
0156                  * SLAB_RECLAIM_ACCOUNT flag claim to have contents
0157                  * which are reclaimable, under pressure.  The dentry
0158                  * cache and most inode caches should fall into this
0159                  */
0160                 free += global_page_state(NR_SLAB_RECLAIMABLE); //SLAB可回收的页面数
0161 
0162                 /*
0163                  * Reserve some for root
0164                  */
0165                 if (!cap_sys_admin)
0166                         free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10); //给root用户保留的页面数
0167 
0168                 if (free > pages)
0169                         return 0;
0170 
0171                 /*
0172                  * nr_free_pages() is very expensive on large systems,
0173                  * only call if we're about to fail.
0174                  */
0175                 n = nr_free_pages(); //当前free memory页面数
0176 
0177                 /*
0178                  * Leave reserved pages. The pages are not for anonymous pages.
0179                  */
0180                 if (n <= totalreserve_pages)
0181                         goto error;
0182                 else
0183                         n -= totalreserve_pages;
0184 
0185                 /*
0186                  * Leave the last 3% for root
0187                  */
0188                 if (!cap_sys_admin)
0189                         n -= n / 32;
0190                 free += n;
0191 
0192                 if (free > pages)
0193                         return 0;
0194 
0195                 goto error;
0196         }
0197 
0198         allowed = vm_commit_limit();
0199         /*
0200          * Reserve some for root
0201          */
0202         if (!cap_sys_admin)
0203                 allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
0204 
0205         /* Don't let a single process grow too big:
0206            leave 3% of the size of this process for other processes */
0207         if (mm)
0208                 allowed -= mm->total_vm / 32;
0209 
0210         if (percpu_counter_read_positive(&vm_committed_as) < allowed)
0211                 return 0;
0212 error:
0213         vm_unacct_memory(pages);
0214 
0215         return -ENOMEM;
0216 }
复制

接下来解读一下vmstat中的ACTIVE/INACTIVE MEMORY

vmstat -a 命令能看到active memory 和 inactive memory：

$ vmstat -a 
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 1  0 138096 319560 1372408 1757848    0    0     2     3    2    3  1  0 99  0  0
复制

但它们的含义在manpage中只给了简单的说明，并未详细解释：

inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option)

在此我们试图准确理解它的含义。通过阅读vmstat的源代码(vmstat.c和proc/sysinfo.c)得知，vmstat命令是直接从/proc/meminfo中获取的数据：

$ grep -i act /proc/meminfo
Active:          1767928 kB
Inactive:        1373760 kB
复制

而/proc/meminfo的数据是在以下内核函数中生成的：

fs/proc/meminfo.c:
==================
0023 static int meminfo_proc_show(struct seq_file *m, void *v)
0024 {
...
 
0032         unsigned long pages[NR_LRU_LISTS];
...
 
0051         for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
0052                 pages[lru] = global_page_state(NR_LRU_BASE + lru);
...
 
0095                 "Active:         %8lu kB\n"
0096                 "Inactive:       %8lu kB\n"
0097                 "Active(anon):   %8lu kB\n"
0098                 "Inactive(anon): %8lu kB\n"
0099                 "Active(file):   %8lu kB\n"
0100                 "Inactive(file): %8lu kB\n"
...
 
0148                 K(pages[LRU_ACTIVE_ANON]   + pages[LRU_ACTIVE_FILE]),
0149                 K(pages[LRU_INACTIVE_ANON] + pages[LRU_INACTIVE_FILE]),
0150                 K(pages[LRU_ACTIVE_ANON]),
0151                 K(pages[LRU_INACTIVE_ANON]),
0152                 K(pages[LRU_ACTIVE_FILE]),
0153                 K(pages[LRU_INACTIVE_FILE]),
...
复制

这段代码的意思是统计所有的LRU list，其中Active Memory等于ACTIVE_ANON与ACTIVE_FILE之和，Inactive Memory等于INACTIVE_ANON与INACTIVE_FILE之和。

LRU list是Linux kernel的内存页面回收算法(Page Frame Reclaiming Algorithm)所使用的数据结构，LRU是Least Recently Used的缩写词。这个算法的核心思想是：回收的页面应该是最近使用得最少的，为了实现这个目标，最理想的情况是每个页面都有一个年龄项，用于记录最近一次访问页面的时间，可惜x86 CPU硬件并不支持这个特性，x86 CPU只能做到在访问页面时设置一个标志位Access Bit，无法记录时间，所以Linux Kernel使用了一个折衷的方法：它采用了LRU list列表，把刚访问过的页面放在列首，越接近列尾的就是越长时间未访问过的页面，这样，虽然不能记录访问时间，但利用页面在LRU list中的相对位置也可以轻松找到年龄最长的页面。Linux kernel设计了两种LRU list: active list 和 inactive list, 刚访问过的页面放进active list，长时间未访问过的页面放进inactive list，这样从inactive list回收页面就变得简单了。内核线程kswapd会周期性地把active list中符合条件的页面移到inactive list中，这项转移工作是由refill_inactive_zone()完成的。

LRU list 示意图

vmstat看到的active/inactive memory就分别是active list和inactive list中的内存大小。如果inactive list很大，表明在必要时可以回收的页面很多；而如果inactive list很小，说明可以回收的页面不多。

Active/inactive memory是针对用户进程所占用的内存而言的，内核占用的内存（包括slab）不在其中。

至于在源代码中看到的ACTIVE_ANON和ACTIVE_FILE，分别表示anonymous pages和file-backed pages。用户进程的内存页分为两种：与文件关联的内存（比如程序文件、数据文件所对应的内存页）和与文件无关的内存（比如进程的堆栈，用malloc申请的内存），前者称为file-backed pages，后者称为anonymous pages。File-backed pages在发生换页(page-in或page-out)时，是从它对应的文件读入或写出；anonymous pages在发生换页时，是对交换区进行读/写操作。