https://www.jianshu.com/p/765476290f29
Last week, a very small number of our users who are using IP tunnels (primarily tunneling IPv6 over IPv4) were unable to access our services because a networking change broke "path MTU discovery" on our servers. In this article, I'll explain what path MTU discovery is, how we broke it, how we fixed it, and the open source code we used.
上周,由于服务器上网络变更导致“链路MTU探测”功能不可用,导致部分使用IP隧道(主要是基于IPv6 over IPv4隧道)的用户无法连接服务。本文中,将探讨什么是链路MTU探测,什么原因导致它不可用,又如何修复,以及我们使用的相关开源代码。
When a host on the Internet wants to send some data, it must know how to divide the data into packets. And in particular, it needs to know the maximum size of packet. The maximum size of a packet a host can send is called Maximum Transmission Unit: MTU.
互联网上的主机发送数据时,必须知道如何将数据切分成报文。尤其主机得知道报文的最大尺寸。主机能发送的报文的最大长度称之为Maximum Transmission Unit: MTU。
The longer the MTU the better for performance, but the worse for reliability. This is because a lost packet means more data to be retransmitted and because many routers on the Internet can't deliver very long packets.
MTU越大,性能越好,可靠性越差。互联网上的很多路由器并不支持太大的MTU值会导致大MTU的报文被丢弃;一旦发生丢包,MTU越大,需要重传的数据越多。
The fathers of the Internet assumed that this problem would be solved at the IP layer with IP fragmentation. Unfortunately IP fragmentation has serious disadvantages, and it's avoided in practice.
互联网之父认为该问题可以通过IP层分片解决。不幸的是IP层分片有很大的弊端,实际场景下尽量避免IP层分片。
To work around fragmentation problems the IP layer contains a "Don't Fragment" bit on every IP packet. It forbids any router on the path from performing fragmentation.
通过IP报文中的“禁止分片”标志位,来避免IP分片的问题。该标志位禁止接收到该报文的路由器对报文进行分片。
So what does the router do if it can't deliver a packet and can't fragment it either?
既不能转发报文,又不能分片报文,路由器又将何从下手?
That's when the fun begins.
有趣的来了。
According to RFC1191:
根据 RFC1191:
When a router is unable to forward a datagram because it exceeds the MTU of the next-hop network and its Don't Fragment bit is set, the router is required to return an ICMP Destination Unreachable message to the source of the datagram, with the Code indicating "fragmentation needed and DF set".
当路由器因为报文尺寸超过下一跳接口MTU值,又因为报文设置了“禁止分片”标志位不能对其进行分片时,路由器需要对报文的源端发出“标记着‘需要分片但又禁止分片’的ICMP目标不可达”的消息。
So a router must send ICMP type 3 code 4 message. If you want to see one type:
所以路由器必须发送 ICMP 类型3 代码4的消息。可以通过如下方式观察到该类型报文:
tcpdump -s0 -p -ni eth0 'icmp and icmp[0] == 3 and icmp[1] == 4'
This ICMP message is supposed to be delivered to the originating host, which in turn should adjust the MTU setting for that particular connection. This mechanism is called Path MTU Discovery.
该ICMP消息应被送至源节点,该源节点也将根据具体的连接调整MTU值得大小。这种机制就称为链路MTU探测。
In theory, it's great but unfortunately many things can go wrong when delivering ICMP packets. The most common problems are caused by misconfigured firewalls that drop ICMP packets.
理想很丰满,现实很骨感。在ICMP回送过程中经常会出岔子。最常见的状况是错误配置的防火墙把ICMP消息丢弃了。
A situation when a router drops packets but for some reason can't deliver relevant ICMP messages is called an ICMP black hole.
路由器丢包,回送的ICMP消息也无法正确被投递,这就是ICMP黑洞。
When that happens the whole connection gets stuck. The sending side constantly tries to resend lost packets, while the receiving side acknowledges only the small packets that get delivered.
发生上述情况时连接就被卡住了。发送端不停的尝试重传丢失的报文,但是接收端只能收到并确认对方发出的小尺寸报文。
There are generally three solutions to this problem.
有3种方案解决该问题。
1) Reduce the MTU on the client side.
1) 减小客户端的MTU值。
The network stack hints its MTU in SYN packets when it opens a TCP/IP connection. This is called the MSS TCP option. You can see it in tcpdump, for example on my host (notice mss 1460
):
网络栈在启动TCP/IP连接时,会在SYN报文种通过TCP的MSS值暗示网络的MTU值。通过下述tcpdump抓包示例观察(注意mss 1460
):
$ sudo tcpdump -s0 -p -ni eth0 '(ip and ip[20+13] & tcp-syn != 0)'
10:24:13.920656 IP 192.168.97.138.55991 > 212.77.100.83.23: Flags [S], seq 711738977, win 29200, options [mss 1460,sackOK,TS val 6279040 ecr 0,nop,wscale 7], length 0
If you know you are behind a tunnel you might consider advising your operating system to reduce the advertised MTUs. In Linux it's as simple as (notice the advmss 1400
):
如果知道连接需要穿越一条网络隧道,那建议削减操作系统通告的MTU值。使用Linux系统的话示例如下(注意advmss 1400
):
$ ip route change default via <> advmss 1400
$ sudo tcpdump -s0 -p -ni eth0 '(ip and ip[20+13] & tcp-syn != 0)'
10:25:48.791956 IP 192.168.97.138.55992 > 212.77.100.83.23: Flags [S], seq 389206299, win 28000, options [mss 1400,sackOK,TS val 6302758 ecr 0,nop,wscale 7], length 0
2) Reduce the MTU of all connections to the minimal value on the server side.
2) 在服务端将MTU值降至最小值。
It's a much harder problem to solve on the server side. A server can encounter clients with many different MTUs, and if you can't rely on Path MTU Discovery, you have a real problem. One commonly used workaround is to reduce the MTU for all of the outgoing packets. The minimal required MTU for all IPv6 hosts is 1,280, which is fair. Unfortunately for IPv4 the value is 576 bytes. On the other hand RFC4821 suggests that it's "probably safe enough" to assume minimal MTU of 1,024. To change the MTU setting you can type:
服务端解决该问题比客户端更棘手。服务端连接的不同客户端拥有不同的MTU值,如果服务端链路MTU探测功能不可靠的话,那就真的遇上了大麻烦。常用的解决办法是削减所有出向报文的MTU值。IPv6下最小MTU值是1280还可以接受。IPv4下最小MTU是576就小到无法接受了。可以通过如下命令将MTU更改为RFC4821 建议的"可能足够安全"的最小MTU值1024:
ip -f inet6 route replace <> mtu 1280
(As a side note you could use advmss
instead. The mtu
setting is stronger, applying to all the packets, not only TCP. While advmss
affects only the Path MTU hints given on TCP layer.)
(注:尽管可以使用advmss
替代mtu
,但mtu
更保险。因为advmss
仅对TCP生效,而mtu
对所有类型网络包均生效。)
Forcing a reduced packet size is not an optimal solution though.
强制削减报文大小并不是一个最优方案。
3) Enable smart MTU black hole detection.
3) 开启智能MTU黑洞探测。
RFC4821 proposes a mechanism to detect ICMP black holes and tries to adjust the path MTU in a smart way. To enable this on Linux type:
RFC4821 提出了一个智能探测ICMP黑洞并调整MTU值的机制。在Linux系统上可以通过如下办法启用改机制:
echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing
echo 1024 > /proc/sys/net/ipv4/tcp_base_mss
The second setting bumps the starting MSS used in discovery from a miserable default of 512 bytes to an RFC4821 suggested 1,024.
第二条配置将探测时的起步MSS值从默认的512增加到RFC4821建议的1024.
To understand what happened last week, first we need to make a detour and discuss our architecture. At CloudFlare we use BGP in two ways. We use it externally, the BGP that organizes the Internet, and internally between servers.
为了弄明白上周发生了什么,我们研究讨论了我们的架构。在CloudFlare中,有两方面使用了BGP。外网使用BGP连接互联网,内网使用BGP连接服务器。
The internal use of BGP allows us to get high availability across servers. If a machine goes down the BGP router will notice and will push the traffic to another server automatically.
内网使用BGP提高了服务器的可用性。一旦BGP路由器侦测到服务器宕机,它将自动把流量转发至其他可用服务器。
We've used this for a long time but last week increased its use in our network as we increased internal BGP load balancing. The load balancing mechanism is known as ECMP: Equal Cost Multi Path routing. From the configuration point of view the change was minimal, it only required adjusting the BGP metrics to equal values, the ECMP-enabled router does the rest of the work.
尽管我们使用了这套机制很久,但上周我们通过增加内部BGP负载均衡加深了它在网络中的使用程度。这种负载均衡机制也叫做ECMP: Equal Cost Multi Path routing等价多路由。从配置层面而言改动很小,它仅仅将BGP度量值调整至相等,剩下的工作就交给启用了ECMP的路由器完成。
But with ECMP the router must be somewhat smart. It can't do true load balancing, as packets from one TCP connections could end up on a wrong server. Instead in ECMP the router counts a hash from the usual tuple extracted from every packet:
但是开启了ECMP的路由器还必须有点小聪明。它不能真的去做负载均衡,做真的负载均衡的话,TCP连接将被连至错误的服务器。取而代之的是路由器的ECMP功能需要从每个报文中提取常见的元组,根据元组计算哈希值。
And uses this hash to chose destination server. This guarantees that packets from one "flow" will always hit the same server.
通过这些元素的哈希决定转发的目服务器。这确保了同一条流的所有报文都将转发至同一台服务器。
ECMP will indeed forward TCP packets in a session to the appropriate server. Unfortunately, it has no special knowledge of ICMP and it hashes only (src ip, dst ip).
ECMP可以将一个会话中的TCP报文转发至正确的服务器。但对ICMP报文却不会正确转发,ECMP只会根据源IP、目的IP将ICMP报文哈希转发。
Where the the source IP is most likely an IP of a router on the Internet. In practice, this may sometimes end up delivering the ICMP packets to a different server than the one handling the TCP connection.
由于源IP很有可能是互联网环境下的一台路由器的IP。实际环境下,很有可能ICMP报文被投递到其他错误的服务器而不是那台处理TCP连接的服务器。
This is exactly what happened in our case.
这就是当前我们环境里发生的状况。
A couple of users who were accessing CloudFlare via IPv6 tunnels reported this problem. They were affected as almost every tunneling IPv6 has a reduced MTU.
一些通过IPv6隧道访问CloudFlare的用户报告了这个问题,因为几乎每条IPv6隧道都降低了MTU值。
As a temporary fix we reduced the MTU for all IPv6 paths to 1,280 (solution mentioned as #2). Many other providers have the same problem and use this trick on IPv6, and never send IPv6 packets greater than 1,280 bytes.
临时的解决办法是把所有IPv6链路的MTU值降低到1280(前文提到的解决方案2)。其他供应商也在IPv6网络上采用了同样的解决办法,确保在IPv6网络上的报文不会超过1280字节。
For better or worse this problem is not so prevalent on IPv4. Fewer people use IPv4 tunneling and the Path MTU problems were well understood for much longer. As temporary fix for IPv4, we deployed RFC4821 path MTU discovery (the solution mentioned in #3 above).
不论怎样该问题在IPv4网络上并不普遍。很少人使用IPv4隧道,而且IPv4下的链路MTU问题也广为人知。所以针对IPv4网络,我们部署了RFC4821链路MTU探测作为临时解决方案。
In the meantime we were working on a comprehensive and reliable solution the restores the real MTUs. The idea is to broadcast ICMP MTU messages to all servers. That way we can guarantee that ICMP messages will hit the relevant server that handles a particular flow, no matter where the ECMP router decides to forward it.
与此同时,我们也开发了一个通用的可靠解决方案,通过给所有服务器广播与链路MTU相关的ICMP消息来解决MTU问题。这样无论ECMP如何路由转发,我们都能保证处理某条流的相关服务器能够收到ICMP消息。
We're happy to open source our small "Path MTU Daemon" that does the work:
下面的连接开源了我们实际有用的"Path MTU Daemon"解决方案:
Delivering Path MTU ICMP messages correctly in a network that uses ECMP for internal routing is a surprisingly hard problem. While there is still plenty to improve we're happy to report:
在使用ECMP的网络中正确的传输链路MTU相关的ICMP报文是个出人意料的难题。但仍然有些可改进之处值得汇报:
Like many other companies we've reduced the MTU on IPv6 to safe
value of 1,280. We'll consider bumping it to improve performance
once we get more confident that our other fixes are working as
intended.
像其他供应商一样,我们把IPv6下的MTU值更改为安全的1280。一旦我们确信其他修改方案如期生效,我们也将增加该值以提升网络性能。
We're rolling out PMTUD to ensure "proper" routing of ICMP's path MTU messages within our datacenters.
我们也将推出 PMTUD 确保 ICMP的链路MTU相关消息能被正确的路由。
We're enabling the Linux RFC4821 Path MTU Discovery for IPv4, which should help with true ICMP black holes.
*我们开启了Linux IPv4下的RFC4821,这将有助于ICMP 黑洞探测。
Found this article interesting? Join our team, including to our elite office in London. We're constantly looking for talented developers and network engineers!
觉得本文有趣么?加入我们包括伦敦精英办公室在内的团队吧。我们渴求有天赋的开发者和网络工程师。
To simulate a black hole we used this iptables rule:
使用下面的iptables规则模拟网络黑洞:
iptables -I INPUT -s <src_ip> -m length --length 1400:1500 -j DROP
And this Scapy snippet to simulate MTU ICMP messages:
使用下面的Scapy脚本模拟ICMP MTU消息:
#!/usr/bin/python
from scapy.all import *
def callback(pkt):
pkt2 = IP(dst=pkt[IP].src) /
ICMP(type=3,code=4,unused=1280) /
str(pkt[IP])[:28]
pkt2.show()
send(pkt2)
if __name__ == '__main__':
sniff(prn=callback,
store=0,
filter="ip and src <src_ip> and greater 1400")
译自:https://blog.cloudflare.com/path-mtu-discovery-in-practice/