https://wenfh2020.com/2021/10/12/thundering-herd-tcp-reuseport/
SO_REUSEPORT (reuseport) 是网络的一个选项设置,它能开启内核功能:网络链接分配 内核负载均衡
。
该功能允许多个进程/线程 bind/listen 相同的 IP/PORT,提升了新链接的分配性能。
nginx 开启 reuseport 功能后,性能有立竿见影的提升,我们结合 tcp 协议分析 nginx 的 reuseport 功能。
reuseport 也是内核解决 惊群问题
的优秀方案。
每个进程可以 bind/listen 相同的 IP/PORT,相当于每个进程拥有独立的 listen socket 的完全队列,避免了共享 listen socket 的资源争抢,提升了并发的吞吐。
内核通过哈希算法,将新链接相对均衡地分配到各个开启了 reuseport 属性的进程,所以资源的负载均衡得到解决。
从下面这段英文提取一些关键信息:
SO_REUSEPORT 是网络的一个选项设置,它允许多个进程/线程 bind/listen 相同的 IP/PORT,在 TCP 的应用中,它是一个新链接分发的(内核)负载均衡功能,它提升了新链接的分配性能(针对 accept )。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
Socket options
The socket options listed below can be set by using setsockopt(2)
and read with getsockopt(2) with the socket level set to
SOL_SOCKET for all sockets. Unless otherwise noted, optval is a
pointer to an int.
...
SO_REUSEPORT (since Linux 3.9)
Permits multiple AF_INET or AF_INET6 sockets to be bound
to an identical socket address. This option must be set
on each socket (including the first socket) prior to
calling bind(2) on the socket. To prevent port hijacking,
all of the processes binding to the same address must have
the same effective UID. This option can be employed with
both TCP and UDP sockets.
For TCP sockets, this option allows accept(2) load
distribution in a multi-threaded server to be improved by
using a distinct listener socket for each thread. This
provides improved load distribution as compared to
traditional techniques such using a single accept(2)ing
thread that distributes connections, or having multiple
threads that compete to accept(2) from the same socket.
For UDP sockets, the use of this option can provide better
distribution of incoming datagrams to multiple processes
(or threads) as compared to the traditional technique of
having multiple processes compete to receive datagrams on
the same socket.
|
注释文字来源 (链接需要FQ):socket(7) — Linux manual page
SO_REUSEPORT 功能解决了什么问题?我们先看看 2013 年 3.9+ 版本内核提交的这个 Linux 内核功能 补丁 的注释。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
soreuseport: TCP/IPv4 implementation
Allow multiple listener sockets to bind to the same port.
Motivation for soresuseport would be something like a web server
binding to port 80 running with multiple threads, where each thread
might have it's own listener socket. This could be done as an
alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads. In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets. We have seen the disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest. With so_reusport the distribution is
uniform.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
master
v5.13
…
v3.9-rc1
@davem330
Tom Herbert authored and davem330 committed on 24 Jan 2013
1 parent 055dc21 commit da5e36308d9f7151845018369148201a5d28b46d
|
reuseport 选项主要解决了两个问题:
其实它还解决了一个很重要的问题:
在 tcp 多线程场景中,(B 图)服务端如果所有新链接只保存在一个 listen socket 的 全链接队列
中,那么多个线程去这个队列里获取(accept)新的链接,势必会出现多个线程对一个公共资源的争抢,争抢过程中,大量资源的损耗。
而(C 图)有多个 listener 共同 bind/listen 相同的 IP/PORT,也就是说每个进程/线程有一个独立的 listener,相当于每个进程/线程独享一个 listener 的全链接队列,不需要多个进程/线程竞争某个公共资源,能充分利用多核,减少竞争的资源消耗,效率自然提高了。
SO_REUSEPORT 功能使用,可以通过网络选项进行设置,在 bind 前面设置即可,使用比较简单。
1 2 3 |
int fd, reuse = 1;
fd = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, (const void *)&reuse, sizeof(int));
|
TCP 客户端链接服务端,第一次握手,服务端被动收到第一次握手 SYN 包,内核就通过哈希算法,将客户端的链接分派到内核半链接队列,三次握手成功后,再将这个链接从半链接队列移动到某个 listener 的全链接队列中,提供 accept 获取。
图片来源:TCP 三次握手(内核)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
/* include/net/inet_hashtables.h */
static inline struct sock *__inet_lookup(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
const __be32 saddr, const __be16 sport,
const __be32 daddr, const __be16 dport,
const int dif, const int sdif,
bool *refcounted)
{
u16 hnum = ntohs(dport);
struct sock *sk;
/* skb 包,从 established 哈希表中查找是否已有 established 的包。*/
sk = __inet_lookup_established(net, hashinfo, saddr, sport,
daddr, hnum, dif, sdif);
*refcounted = true;
if (sk)
return sk;
*refcounted = false;
/* 上面没找到,那么就找一个合适的 listener,去走三次握手流程。 */
return __inet_lookup_listener(net, hashinfo, skb, doff, saddr,
sport, daddr, hnum, dif, sdif);
}
/* net/ipv4/inet_hashtables.c */
struct sock *__inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
const __be32 saddr, __be16 sport,
const __be32 daddr, const unsigned short hnum,
const int dif, const int sdif) {
struct inet_listen_hashbucket *ilb2;
struct sock *result = NULL;
unsigned int hash2;
/* 通过目标 ip/port 哈希值从 hashinfo.lhash2 查找对应的 slot。*/
hash2 = ipv4_portaddr_hash(net, daddr, hnum);
ilb2 = inet_lhash2_bucket(hashinfo, hash2);
/* 再从对应的 slot 上,搜索哈希链上的数据。 */
result = inet_lhash2_lookup(net, ilb2, skb, doff,
saddr, sport, daddr, hnum,
dif, sdif);
...
return result |