[转帖]0.03秒引发的网络血案

引发,网络,血案 · 浏览次数 : 0

小编点评

## Fault Analysis Summary The issue seems to be related to a transient network outage on the OpenStack network, specifically on the `br-int` network interface. The outage manifests as Pod connectivity dropping randomly, with the pods experiencing ping failures with the other Pod. **Key observations:** * The problem only occurs occasionally, suggesting a transient nature. * It seems to affect both incoming and outgoing traffic. * The problem occurs even when pods are not running on specific nodes, indicating it might be related to network communication. * The problem can be resolved by simply restarting the pods involved, indicating that the issue might be related to the pod communication itself. **Possible causes:** * **Network outage on `br-int` network interface:** This seems like the most likely cause, as it's directly involved in the pod communication and the outage affects both incoming and outgoing traffic. * **Transient network connectivity issues:** This could be caused by temporary glitches or other issues with the network connectivity. * **Bug in OpenStack agent or OVS itself:** A bug in the OpenStack agent or the OVS itself could be causing the network outage. * **Misconfiguration of the network monitoring or alerting system:** This could be missing or not detecting the network outage properly. **Further troubleshooting steps:** * Review the logs from the OpenStack agent and OVS for any relevant error messages or warnings. * Check the network infrastructure for any issues or outages. * Investigate the network configuration and ensure it's functioning properly. * Verify the health and performance of the OpenStack agent and OVS. * Review the logs from the affected pods to see if any error messages are logged. * If the issue persists, consider reaching out to the OpenStack community forums or community support channels for further assistance.

正文

https://www.jianshu.com/p/45085331b9f0
复制

背景

用户Pike版Openstack，Firewall drivers为Openvswitch。Openstack内一租户网络下多台虚拟机中部署一K8S集群，其中Openstack下租户网络使用VxLAN，K8S集群采用Calico IPinIP网络方案。

故障发生

用户报告K8S集群中Pod偶发网络不通。故障表现随机。并不固定在某些计算节点，也不固定于某些虚拟机，出现时间随机；用户在不变动任何Openstack内虚拟机或网络资源的情况下，仅仅重建Pod，即恢复正常。而且出现网络中断的Pod原本处于正常通讯的状态。由于前两次用户通过重建Pod解决故障，并未留下现场尸体，排查并没有实质性进展。遂让用户下次出现同样故障时留下现场待排查根源后再恢复。
PS：Random failure is quite a challenge.

故障现场

偶发的一次机会让我们再次遇到了这个事故现场。如下图显示位于不同计算节点中的Pod两两互Ping，仅其中一对Pod出现Ping不通的情况。

通过对这两个Pod所在节点（下图T1、T2、P1、P1）进行抓包，初步断定了丢包点位于Com1的br-int上。Pod1发出的ICMP Response在br-int上不翼而飞。

在Pod上持续Ping的同时，观察Com1节点br-int上流表定位到了这条流表的计数在同步增长。同步查看该OVS上的连接跟踪记录，可以明显看到VM1至VM2的协议号为4（即IPinIP报文）的流被标记为“mark=1”。基本可以确认VM1至VM2的IPinIP报文命中了这一条流表，导致Pod1与Pod2不通。
持续丢包的流表

cookie=0xc731e03bbfa62732, duration=8725.531s, table=72, n_packets=16642, n_bytes=1668970, idle_age=0, priority=50,ct_mark=0x1,reg5=0x362 actions=drop
复制

异常情况下的连接跟踪

4,orig=(src=10.19.0.33,dst=10.19.0.35,sport=0,dport=0),reply=(src=10.19.0.35,dst=10.19.0.33,sport=0,dport=0),zone=49,mark=1
复制

故障分析

Pike版Openstack br-int 上table 72中的流表项目与虚拟机对应的Port 出向Security Group有关（在Neutron代码中表72的宏定义名称即为rulers_egress_table）。接下来我们分析下为什么会出现这样的状况。

正常情况

正常情况下，通过IPinIP的ICMP Response报文会在表72中命中下述流表，进行正常的转发。

cookie=0xc731e03bbfa62732, duration=8725.528s, table=72, n_packets=68693, n_bytes=12470448, idle_age=0, priority=77,ct_state=+est-rel-rpl,ip,reg5=0x362,nw_proto=4 actions=resubmit(,73)
复制

异常情况

异常情况下，对应上述的流表依旧存在，却没有命中；而是命中了低优先级的ct_mark=0x1,reg5=0x362 actions=drop。若要命中该流表，对应流量就必须先行命中ct_state=+est,ip,reg5=0x362 actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[]))，将连接跟踪表中的mark置为1(invalid)。但这些状况在正常情况下不应出现。

cookie=0xc731e03bbfa62732, duration=8725.528s, table=72, n_packets=68693, n_bytes=12470448, idle_age=0, priority=77,ct_state=+est-rel-rpl,ip,reg5=0x362,nw_proto=4 actions=resubmit(,73)
cookie=0xc731e03bbfa62732, duration=8725.531s, table=72, n_packets=16642, n_bytes=1668970, idle_age=0, priority=50,ct_mark=0x1,reg5=0x362 actions=drop
cookie=0xc731e03bbfa62732, duration=8725.531s, table=72, n_packets=2, n_bytes=156, idle_age=8725, priority=40,ct_state=+est,ip,reg5=0x362 actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[]))
复制

进一步观察流表（duration值），可以发现三条流表的生存周期有0.03s的轻微时间差，本应正确命中的流表比当前异常情况下命中的流表晚下发了0.03秒。至此可以得出一个初步的故障原因结论：0.03秒的流表下发时间差导致了当前流量的中断。具体分析如下图。

仔细观察Neutron代码，也可以发现流表的下发流程之中，_initialize_tracked_egress也发生在create_flows_from_rule_and_port之前。

def _initialize_tracked_egress(self, port):
    # Drop invalid packets
    self._add_flow(
        table=ovs_consts.RULES_EGRESS_TABLE,
        priority=50,
        ct_state=ovsfw_consts.OF_STATE_INVALID,
        actions='drop'
    )
    # Drop traffic for removed sg rules
    self._add_flow(
        table=ovs_consts.RULES_EGRESS_TABLE,
        priority=50,
        reg_port=port.ofport,
        ct_mark=ovsfw_consts.CT_MARK_INVALID,
        actions='drop'
    )
    ......

def add_flows_from_rules(self, port):
    self._initialize_tracked_ingress(port)
    self._initialize_tracked_egress(port)
    LOG.debug('Creating flow rules for port %s that is port %d in OVS',
              port.id, port.ofport)
    for rule in self._create_rules_generator_for_port(port):
        # NOTE(toshii): A better version of merge_common_rules and
        # its friend should be applied here in order to avoid
        # overlapping flows.
        flows = rules.create_flows_from_rule_and_port(rule, port)
        LOG.debug("RULGEN: Rules generated for flow %s are %s",
                  rule, flows)
        for flow in flows:
            self._accept_flow(**flow)

    self._add_non_ip_conj_flows(port)

    self.conj_ip_manager.update_flows_for_vlan(port.vlan_tag)
复制

安全组流表的更新通常发生在安全组更新之后，通过对上述流表生存周期与排查时间的反推，得到的安全组更新时间与openstack显示的用户更新安全组时间一致。根据用户的反馈，故障察觉时间也与安全组更新时间高度吻合。

最终结论

用户更新虚拟机的安全组，OVS agent更新（删除&新增）安全组流表规则存在的时间差，好巧不巧，恰好在该短暂的时间差内，虚拟机中容器正在进行通信，导致流量被先行下发的流表错误标记为invalid，并记录了conntrack，后续流量进而持续被drop，无法继续正常通信。

[转帖]0.03秒引发的网络血案

小编点评

正文

背景

故障发生

故障现场

故障分析

正常情况

异常情况

最终结论

与[转帖]0.03秒引发的网络血案相似的内容：

[转帖]0.03秒引发的网络血案

[转帖]MySQL 8.0新特性和性能数据

[转帖]Linux—编写shell脚本操作数据库执行sql

[转帖]openeuler22.03实时系统安装及部署

[转帖]Linux 平台使用shc 工具加密shell 脚本

[转帖]openEuler 22.03 LTS 版本发布，已有 8 家伙伴计划推出商业发行版

[转帖]一个空格导致应用启动失败的问题排查

[转帖]openEuler 22.03 LTS 新特性解读 | Preempt_RT

[转帖]03-rsync传输模式（本地传输、远程方式传输、守护进程模式传输）

[转帖]漏洞预警|Apache Tomcat 信息泄露漏洞

# 热门排行