https://blog.mygraphql.com/zh/notes/low-tec/network/tcp-mem/
最近需要支持一个单 POD 的 TCP 连接数上 10k 的基础服务(Cassandra)的容器化。需要对其使用的资源(特别是TCP缓存内存),以及对相邻 Pod(同一 worker node 上运行的)影响(即容器隔离情况),等进行预估。故写本文,以备忘。希望对读者也有一定参考价值,毕竟做技术要较真,要么有时间和能力就自己看内核源码,如果不能,要看文档和文章的话,只能货比三家才靠谱。
资料搜集不单单是个技术活,也是语言艺术活。如,输入什么关键字才合适、哪些资料来源看来更靠谱……
由于是资料收集,加上本人的翻译水平有限,所以我是尽量少翻译,保持原文,谢谢理解。同时我会加入一些个人注解,以供参考。
最近需要支持一个单 POD 的 TCP 连接数上 10k 的基础服务(Cassandra)的容器化。需要对其使用的资源(特别是TCP缓存内存),以及对相邻 Pod(同一 worker node 上运行的)影响(即容器隔离情况),等进行预估。Cassandra 的官网推崇使用的 TCP 参数:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Cassandra 是容器化前的产品,官网的旧说明当然是假设这个 worknode / VM 上只跑 Cassandra 的情况下适用。它不会去帮我考虑在 k8s 环境,应用与 Cassandra 混合部署,甚至是使用 Ceph PVC 作为存储层的场景(虽然官网已经说了,不建议用SAN, cassandra.apache.org)。
那么问题来了。这个参数会对其它 POD带来什么影响?要回答这个问题,首先要知道:
本文尝试为前两点作一些资料收集与总结。
Linux supports RFC 1323 TCP high performance extensions. These include Protection Against Wrapped Sequence Numbers (PAWS)
, Window Scaling
and Timestamps
. Window scaling
allows the use of large (> 64 kB) TCP windows in order to support links with high latency or bandwidth. To make use of them, the send and receive buffer sizes must be increased. They can be set :
/proc/sys/net/ipv4/tcp_wmem
and /proc/sys/net/ipv4/tcp_rmem
filesThe maximum sizes for socket buffers declared via the SO_SNDBUF and SO_RCVBUF mechanisms are limited by the values in the /proc/sys/net/core/rmem_max
and /proc/sys/net/core/wmem_max
files. Note that TCP actually allocates twice the size of the buffer requested in the setsockopt(2) call, and so a succeeding getsockopt(2) call will not return the same size of buffer as requested in the setsockopt(2) call. TCP uses the extra space for administrative purposes and internal kernel structures, and the /proc file values reflect the larger sizes compared to the actual TCP windows. On individual connections, the socket buffer size must be set prior to the listen(2) or connect(2) calls in order to have it take effect. See socket(7) for more information.
Kernel 提供两种调整 TCP 接收窗口大小的方法:
应用程序手工调整窗口与缓存大小
调用 setsockopt(2) ,指定 SO_SNDBUF
与 SO_RCVBUF
。
内核自动调整
使用 /proc/sys/net/ipv4/tcp_wmem
与 /proc/sys/net/ipv4/tcp_rmem
两个参数。
Linux autotuning is logic in the Linux kernel that adjusts the buffer size limits and the receive window based on actual packet processing. It takes into consideration a number of things including
Autotuning can sometimes seem mysterious, but it is actually fairly straightforward.
The central idea is that Linux can track the rate at which the local application is reading data off of the receive queue. It also knows the session RTT. Because Linux knows these things, it can automatically increase the buffers and receive window until it reaches the point at which the application layer or network bottleneck links are the constraint on throughput (and not host buffer settings). At the same time, autotuning prevents slow local readers from having excessively large receive queues. The way autotuning does that is by limiting the receive window and its corresponding receive buffer to an appropriate size for each socket.
The values set by autotuning can be seen via the Linux “ss
” command from the iproute package (e.g. “ss -tmi
”). The relevant output fields from that command are:
💡 使用
ss
可查看 TCP socket 的缓存使用情况,可见我的 Blog: 可能是最完整的 TCP 连接健康指标工具 ss 的说明这里 同时提供一个例子。
Recv-Q
is the number of user payload bytes not yet read by the local application.
rcv_ssthresh
is the window clamp, a.k.a. the maximum receive window size
. This value is not known to the sender. The sender receives only the current window size
, via the TCP header field. A closely-related field in the kernel, tp->window_clamp
, is the maximum window size allowable based on the amount of available memory. rcv_ssthresh
is the receiver-side slow-start threshold value.
skmem_r
is the actual amount of memory that is allocated, which includes not only user payload (Recv-Q
) but also additional memory needed by Linux to process the packet (packet metadata
). This is known within the kernel as sk_rmem_alloc
.
Note that there are other buffers associated with a socket, so skmem_r
does not represent the total memory that a socket might have allocated.
skmem_rb
is the maximum amount of memory that could be allocated by the socket for the receive buffer. This is higher than rcv_ssthresh
to account for memory needed for packet processing that is not packet data. Autotuning can increase this value (up to tcp_rmem
max) based on how fast the L7 application is able to read data from the socket and the RTT of the session. This is known within the kernel as sk_rcvbuf
.
rcv_space
is the high water mark of the rate of the local application reading from the receive buffer during any RTT. This is used internally within the kernel to adjust sk_rcvbuf
.
Earlier we mentioned a setting called tcp_rmem
. net.ipv4.tcp_rmem
consists of three values, but in this document we are always referring to the third value (except where noted). It is a global setting that specifies the maximum amount of memory that any TCP receive buffer can allocate, i.e. the maximum permissible value that autotuning can use for sk_rcvbuf
. This is essentially just a failsafe for autotuning, and under normal circumstances should play only a minor role in TCP memory management.
It’s worth mentioning that receive buffer memory is not preallocated. Memory is allocated based on actual packets arriving and sitting in the receive queue. It’s also important to realize that filling up a receive queue is not one of the criteria that autotuning uses to increase sk_rcvbuf
. Indeed, preventing this type of excessive buffering (bufferbloat) is one of the benefits of autotuning.
自从容器和 linux namespace 成为主流后。Linux 一直在容器化这些参数的路上:
从 Linux Kernel v4.15 起,net.ipv4.tcp_rmem
与 net.ipv4.tcp_wmem
已经容器化了。即不同的容器/Linux Network Namespace 可以独立配置:
每个 POD 独立配置这些参数的好处是,不同 POD 的不同应用类似,TCP 的使用可以千差万别:
有的应用可以选择使用 setsockopt(2) 的 SO_SNDBUF
and SO_RCVBUF
去配置。但不是所有应用都适合静态指定,或者都能配置。这时,容器化的配置就可以用上了。
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html?highlight=optmem_max#rmem-default
The default setting of the socket receive buffer in bytes.
Deprecated for TCP socket. TCP 连接下只看 tcp_rmem。
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html?highlight=optmem_max#rmem-max
The maximum receive socket buffer size in bytes. Only for setsockopt()
.
https://cromwell-intl.com/open-source/performance-tuning/tcp.html
maximum receive buffer sizes that can be set using
setsockopt()
, in bytes
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html?highlight=optmem_max#wmem-default
The default setting (in bytes) of the socket send buffer.
Deprecated for TCP socket. TCP 连接下只看 tcp_wmem。
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html?highlight=optmem_max#wmem-max
The maximum send socket buffer size in bytes. Only for setsockopt()
.
https://cromwell-intl.com/open-source/performance-tuning/tcp.html
maximum send buffer sizes that can be set using
setsockopt()
, in bytes
tcp_moderate_rcvbuf (Boolean; default: enabled; since Linux
2.4.17/2.6.7)
If enabled, TCP performs receive buffer auto-tuning,
attempting to automatically size the buffer (no greater
than tcp_rmem[2]) to match the size required by the path
for full throughput.
[TCP.IP Illustrated Vol1]
Large Buffers and Linux TCP Auto-Tuning
https://www.ibm.com/docs/en/linux-on-systems?topic=tuning-tcpip-ipv4-settings
Contains three values that represent the minimum
, default
and maximum
size of the TCP socket receive buffer.
The minimum
represents the smallest receive buffer size guaranteed, even under memory pressure. The minimum value defaults to 1 page or 4096 bytes.
The default
value represents the initial size of a TCP sockets receive buffer. This value supersedes(取而代之) net.core.rmem_default
used by other protocols. The default value for this setting is 87,380 bytes. It also sets the tcp_adv_win_scale
and initializes the TCP window size to 65535 bytes.
The maximum
represents the largest receive buffer size automatically selected for TCP sockets. This value does NOT override net.core.rmem_max
. The default value for this setting is somewhere between 87380 bytes and 6M bytes based on the amount of memory in the system.
The recommendation is to use the maximum value of 16M bytes or higher (kernel level dependent) especially for 10 Gigabit adapters.
min: Minimal size of receive buffer used by TCP sockets. It is guaranteed to each TCP socket, even under moderate memory pressure.
Default: 4K
default: initial size of receive buffer used by TCP sockets. This value overrides net.core.rmem_default used by other protocols. Default: 131072 bytes. This value results in initial window of 65535.
max: maximal size of receive buffer allowed for automatically selected receiver buffers for TCP socket. This value does not override net.core.rmem_max
. Calling setsockopt()
with SO_RCVBUF
disables automatic tuning of that socket’s receive buffer size, in which case this value is ignored. Default: between 131072 and 6MB, depending on RAM size.
根据一些信息,从 Linux Kernel v4.15 起,net.ipv4.tcp_rmem
与 net.ipv4.tcp_wmem
已经容器化了。即不同的容器/Linux Network Namespace 可以独立配置:
[TCP.IP Illustrated Vol1]
Large Buffers and Linux TCP Auto-Tuning
min
: Amount of memory reserved for send buffers for TCP sockets. Each TCP socket has rights to use it due to fact of its birth.Default: 4K
default
: initial size of send buffer used by TCP sockets. This value overrides net.core.wmem_default
used by other protocols.It is usually lower than net.core.wmem_default.
Default: 16K
max
: Maximal amount of memory allowed for automatically tuned send buffers for TCP sockets. This value does not override net.core.wmem_max
. Calling setsockopt()
with SO_SNDBUF
disables automatic tuning of that socket’s send buffer size, in which case this value is ignored.Default: between 64K and 4MB, depending on RAM size.
根据一些信息,从 Linux Kernel v4.15 起,net.ipv4.tcp_rmem
与 net.ipv4.tcp_wmem
已经容器化了。即不同的容器/Linux Network Namespace 可以独立配置:
这是整机(node)级别上,为TCP内存使用配置一个限制/阀值。这个限制/阀值的计账,当然会包括:tcp_rmem / tcp_wmem 等的运行期使用内存。
https://cromwell-intl.com/open-source/performance-tuning/tcp.html
Parameter
tcp_mem
is the amount of memory in 4096-byte pages totaled across all TCP applications. It contains three numbers: the minimum, pressure, and maximum. The pressure is the threshold at which TCP will start to reclaim buffer memory to move memory use down toward the minimum. You want to avoid hitting that threshold.
https://hechao.li/2022/09/30/a-tcp-timeout-investigation/
Kernel error message when OOM:
kernel: TCP: out of memory -- consider tuning tcp_mem
Who sets tcp_mem in the first place?
One observation is that the tcp_mem value is different on different instance types. An instance with a larger memory also has a larger tcp_mem value. Digging deeper, I found that by default this value is set by the Linux kernel using this formula. Also pasting the code here:
1 2 3 4 5 6 7 8 9
static void __init tcp_init_mem(void) { unsigned long limit = nr_free_buffer_pages() / 16; limit = max(limit, 128UL); sysctl_tcp_mem[0] = limit / 4 * 3; /* 4.68 % */ sysctl_tcp_mem[1] = limit; /* 6.25 % */ sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2; /* 9.37 % */ }
tcp_mem (since Linux 2.4)
This is a vector of 3 integers: [low, pressure, high].
These bounds, measured in units of the system page size,
are used by TCP to track its memory usage. The defaults
are calculated at boot time from the amount of available
memory. (TCP can only use low memory for this, which is
limited to around 900 megabytes on 32-bit systems. 64-bit
systems do not suffer this limitation.)
low TCP doesn't regulate its memory allocation when the
number of pages it has allocated globally is below
this number.
pressure
When the amount of memory allocated by TCP exceeds
this number of pages, TCP moderates its memory
consumption. This memory pressure state is exited
once the number of pages allocated falls below the
low mark.
high The maximum number of pages, globally, that TCP
will allocate. This value overrides any other
limits imposed by the kernel.
(integer; default: 2; since Linux 2.4)
Count buffering overhead as bytes/2^tcp_adv_win_scale, if tcp_adv_win_scale is greater than 0; or bytes-bytes/2^(-tcp_adv_win_scale), if tcp_adv_win_scale is less than or equal to zero.
The socket receive buffer space is shared between the application and kernel.
The tcp_adv_win_scale default value of 2 implies that the space used for the application buffer is one fourth that of the total.
net.ipv4.tcp_adv_win_scale
is a (non-intuitive) number used to account for the overhead needed by Linux to process packets. The receive window is specified in terms of user payload bytes. Linux needs additional memory beyond that to track other data associated with packets it is processing.
The value of the receive window changes during the lifetime of a TCP session, depending on a number of factors. The maximum value that the receive window can be is limited by the amount of free memory available in the receive buffer, according to this table:
tcp_adv_win_scale TCP window size
4 15/16 * available memory in receive bufferf
3 ⅞ * available memory in receive buffer
2 ¾ * available memory in receive buffer
1 ½ * available memory in receive buffer
0 available memory in receive buffer
-1 ½ * available memory in receive buffer
-2 ¼ * available memory in receive buffer
-3 ⅛ * available memory in receive buffer
We can intuitively (and correctly) understand that the amount of available memory in the receive buffer is the difference between the used memory and the maximum limit. But what is the maximum size a receive buffer can be? The answer is sk_rcvbuf
.
conf | Description |
---|---|
net.ipv4.tcp_keepalive_intvl | interval in seconds between subsequent keepalive probes. |
net.ipv4.tcp_keepalive_time | interval in seconds before the first keepalive probe. |
net.ipv4.tcp_keepalive_probes | Maximum number of unacknowledged probes before the connection is considered dead. |
Limit of socket listen()
backlog, known in userspace as SOMAXCONN
. Defaults to 4096
. (Was 128
before linux-5.4) See also tcp_max_syn_backlog
for additional tuning for TCP sockets.
深入分析 somaxconn 容器化:http://arthurchiao.art/blog/the-mysterious-container-somaxconn/
Maximal number of remembered connection requests (SYN_RECV
), which have not received an acknowledgment
from connecting client.
This is a per-listener limit.
The minimal value is 128 for low memory machines, and it will increase in proportion to the memory of machine.
If server suffers from overload, try increasing this number.
Remember to also check /proc/sys/net/core/somaxconn
A SYN_RECV
request socket consumes about 304 bytes of memory.
sk_rcvbuf
is a per-socket field that specifies the maximum amount of memory that a receive buffer can allocate. This can be set programmatically with the socket option SO_RCVBUF
. This can sometimes be useful to do, for localhost TCP sessions, for example, but in general the use of SO_RCVBUF
is not recommended.
So how is sk_rcvbuf
set? The most appropriate value for that depends on the latency of the TCP session and other factors. This makes it difficult for L7 applications to know how to set these values correctly, as they will be different for every TCP session. The solution to this problem is Linux autotuning.
容器化使用的资源隔离层 CGroup ,是可以用于限制容器的 TCP 相关内核内存使用的。但,kubernetes 好像暂时未支持作这个限制。
内核 CGroup Memory TCP 控制说明:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
Brief summary of control files.
memory.kmem.limit_in_bytes # set/show hard limit for kernel memory memory.kmem.usage_in_bytes # show current kernel memory allocation memory.kmem.failcnt # show the number of kernel memory usage hits limits memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
With the Kernel memory extension, the Memory Controller is able to limit
the amount of kernel memory used by the system. Kernel memory is fundamentally
different than user memory, since it can’t be swapped out, which makes it
possible to DoS the system by consuming too much of this precious resource.Kernel memory accounting is enabled for all memory cgroups by default. But
it can be disabled system-wide by passingcgroup.memory=nokmem
to the kernel
at boot time. In this case, kernel memory will not be accounted at all.Kernel memory limits are not imposed for the root cgroup. Usage for the root
cgroup may or may not be accounted. The memory used is accumulated intomemory.kmem.usage_in_bytes
, or in a separate counter when it makes sense.
(currently only for tcp).
The main “kmem” counter is fed into themain counter
, so kmem charges will
also be visible from the user counter.Currently no soft limit is implemented for kernel memory. It is future work
to trigger slab reclaim when those limits are reached.2.7.1 Current Kernel Memory resources accounted
stack pages: every process consumes some stack pages. By accounting into
kernel memory, we prevent new processes from being created when the kernel
memory usage is too high.slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy
of eachkmem_cache
is created every time the cache is touched by the first time
from inside thememcg
. The creation is done lazily, so some objects can still be
skipped while the cache is being created. All objects in a slab page should
belong to the same memcg. This only fails to hold when a task is migrated to a
different memcg during the page allocation by the cache.sockets memory pressure: some sockets protocols have memory pressure
thresholds. The Memory Controller allows them to be controlled individually
per cgroup, instead of globally.tcp memory pressure: sockets memory pressure for the tcp protocol.
2.7.2 Common use cases
Because the
kmem counter
is fed to themain user counter
, kernel memory can
never be limited completely independently of user memory. Say “U” is the user
limit, and “K” the kernel limit. There are three possible ways limits can be
set:U != 0, K = unlimited: This is the standard memcg limitation mechanism already present before kmem accounting. Kernel memory is completely ignored. U != 0, K < U: Kernel memory is a subset of the user memory. This setup is useful in deployments where the total amount of memory per-cgroup is overcommited. Overcommiting kernel memory limits is definitely not recommended, since the box can still run out of non-reclaimable memory. In this case, the admin could set up K so that the sum of all groups is never greater than the total memory, and freely set U at the cost of his QoS. WARNING: In the current implementation, memory reclaim will NOT be triggered for a cgroup when it hits K while staying below U, which makes this setup impractical. U != 0, K >= U: Since kmem charges will also be fed to the user counter and reclaim will be triggered for the cgroup for both kinds of memory. This setup gives the admin a unified view of memory, and it is also useful for people who just want to track kernel memory usage.
lwn.net : per-cgroup tcp buffer pressure settings
This patch introduces per-cgroup tcp buffers limitation. This allows
sysadmins to specify a maximum amount of kernel memory that
tcp connections can use at any point in time. TCP is the main interest
in this work, but extending it to other protocols would be easy.For this to work, I am hooking it into
memcg
, after the introdution of
an extension for tracking and controlling objects in kernel memory.
Since they are usually not found in page granularity, and are fundamentally
different from userspace memory (not swappable, can’t overcommit), they
need their special place inside the Memory Controller.Right now, the
kmem
extension is quite basic, and just lays down the
basic infrastucture for the ongoing work.Although it does not account kernel memory allocated - I preferred to
keep this series simple and leave accounting to the slab allocations when
they arrive.What it does is to piggyback in(搭载) the memory control mechanism already present in
/proc/sys/net/ipv4/tcp_mem
. There is asoft limit
, and ahard limit
,
that will suppress allocation when reached. For each cgroup, however,
the filekmem.tcp_maxmem
will be used to cap(限制) those values.The usage I have in mind here is containers. Each container will
define its own values for soft and hard limits, but none of them will
be possibly bigger than the value the box’ sysadmin specified from
the outside.
lwn.net: Per-cgroup TCP buffer limits
Kubernetes 好像暂时未支持作这个限制:
Shall we add resource kernel-memory limit ? #45476
以下是 Google Cloud 的一个关于 多租户环境下,内核 TCP 内存隔离的资料:
Google Cloud: TCP Memory Isolation on Multi-tenant Servers - Sep 13, 2022
Accounting TCP memory
/proc/net/sockstat[6]
and /proc/net/protocols
Limiting TCP memory
/proc/sys/net/ipv4/tcp_mem
(Array of 3 long integers)
Reduce (or prevent increasing) the send or receive buffers for the sockets ●
On RX
On TX ○ May throttle the current thread of the sender
Problem 1: Shared unregulated tcp_mem
limit
When the TCP memory usage hit the TCP limit:
Low priority jobs can hog TCP memory and adversely impact higher priority jobs
memcg-v1
, TCP memory is accounted separately from the memcg memory usage
memcg-v2
, TCP memory is accounted as regular memory
https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
A number of sysctls are namespaced in today’s Linux kernels. This means that they can be set independently for each pod on a node. Only namespaced sysctls are configurable via the pod securityContext
within Kubernetes.
The following sysctls are known to be namespaced. This list could change in future versions of the Linux kernel.
net.*
that can be set in container networking namespace. However, there are exceptions (e.g., net.netfilter.nf_conntrack_max
and net.netfilter.nf_conntrack_expect_max
can be set in container networking namespace but they are unnamespaced).Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node’s operating system, or by using a DaemonSet with privileged containers.
Use the pod securityContext
to configure namespaced sysctls. The securityContext
applies to all containers in the same pod.
This example uses the pod securityContext
to set a safe sysctl kernel.shm_rmid_forced
and two unsafe sysctls net.core.somaxconn
and kernel.msgmax
. There is no distinction between safe and unsafe sysctls in the specification.
|
|
https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/