[转帖]VXLAN & Linux

vxlan,linux · 浏览次数 : 0

小编点评

**Multicast Solution** * Suitable for environments with multicast support. * Scalable if implemented properly. * Provides multicast and broadcast functionality within virtual segments without L2/L3 addresses. * Offers various implementation options, including FRR, BGP, and static/dynamic L2/L3 routing. **Alternative Solution: BGP EVPN** * An alternative controller for distributing VTEP and FDB information. * Uses BGP for address distribution and VPN encapsulation. * Provides a more secure solution with limited bandwidth usage. **Considerations** * Multicast encapsulation adds 50 bytes to each frame. * Encryption (IPsec) adds additional overhead depending on settings. * Underlay MTU should be increased if larger than Ethernet MTU. * IPv6 support is limited in some tools. **Additional Points** * Consider using existing container solutions that support multicast. * Implement proper isolation measures to prevent broadcast and security breaches. * Update software to take advantage of IPv6 features.

正文

  • https://vincent.bernat.ch/en/blog/2017-vxlan-linux

     

VXLAN is an overlay network to carry Ethernet traffic over an existing (highly available and scalable) IP network while accommodating a very large number of tenants. It is defined in RFC 7348.

Starting from Linux 3.12, the VXLAN implementation is quite complete as both multicast and unicast are supported as well as IPv6 and IPv4. Let’s explore the various methods to configure it.

VXLAN setup
VXLAN deployment example

To illustrate our examples, we use the following setup:

  • an underlay IP network (highly available and scalable, possibly the Internet);
  • three Linux bridges acting as VXLAN tunnel endpoints (VTEP); and
  • four servers believing they share a common Ethernet segment.

VXLAN tunnel extends the individual Ethernet segments across the three bridges, providing a unique (virtual) Ethernet segment. From one host (e.g. H1), we can reach directly all the other hosts in the virtual segment:

$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)

--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Basic usage

The reference deployment for VXLAN is to use an IP multicast group to join the other VTEPs:

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   group ff05::100 \
>   dev eth0 \
>   ttl 5
# brctl addbr br100
# brctl addif br100 vxlan100
# brctl addif br100 vnet22
# brctl addif br100 vnet25
# brctl stp br100 off
# ip link set up dev br100
# ip link set up dev vxlan100
1

This is one possible implementation. The bridge is only needed if you require some form of source-address learning for local interfaces. Another strategy is to use MACVLAN interfaces. 

The above commands create a new interface acting as a VXLAN tunnel endpoint, named vxlan100, and put it in a bridge with some regular interfaces.1 Each VXLAN segment is associated with a 24-bit segment ID, the VXLAN Network Identifier (VNI). In our example, the default VNI is specified with id 100.

When VXLAN was first implemented in Linux 3.7, the UDP port to use was not defined. Several vendors were using 8472 and Linux took the same value. To avoid breaking existing deployments, this is still the default value. Therefore, if you want to use the IANA-assigned port, you need to explicitly set it with dstport 4789.

As we want to use multicast, we have to specify a multicast group to join (group ff05::100), as well as a physical device (dev eth0). With multicast, the default TTL is 1. If your multicast network leverages some routing, you’ll have to increase the value a bit, like here with ttl 5.

The vxlan100 device acts as a bridge device with remote VTEPs as virtual ports:

  • it sends broadcast, unknown unicast, and multicast (BUM) frames to all VTEPs using the multicast group; and
  • it discovers the association from Ethernet MAC addresses to VTEP IP addresses using source-address learning.

The following figure summarizes the configuration, with the FDB of the Linux bridge (learning local MAC addresses) and the FDB of the VXLAN device (learning distant MAC addresses):

Bridged VXLAN device
VXLAN device attached to a Linux bridge and communicating with two remote VTEPs

The FDB of the VXLAN device can be observed with the bridge command. If the destination MAC is present, the frame is sent to the associated VTEP (unicast). The all-zero address is only used when a lookup for the destination MAC fails.

# bridge fdb show dev vxlan100 | grep dst
00:00:00:00:00:00 dst ff05::100 via eth0 self permanent
50:54:33:00:00:0b dst 2001:db8:3::1 self
50:54:33:00:00:08 dst 2001:db8:1::1 self

If you are interested to get more details on how to setup a multicast network and build VXLAN segments on top of it, see my “Network virtualization with VXLAN” article.

Without multicast

Using VXLAN over a multicast IP network has several benefits:

2

The underlay multicast network may still need some central components, like rendez-vous points for PIM-SM protocol. Fortunately, it’s possible to make them highly available and scalable (e.g. with Anycast-RP, RFC 4610). 

  • automatic discovery of other VTEPs sharing the same multicast group
  • good bandwidth usage (packets are replicated as late as possible)
  • decentralized and controller-less design2

However, multicast is not available everywhere and managing it at scale can be difficult. In Linux 3.8, the DOVE extensions have been added to the VXLAN implementation, removing the dependency on multicast.

Unicast with static flooding

3

For this example and the following ones, a patch is needed for the ip command (included in 4.11) to use IPv6 for transport. In the meantime, here is a quick workaround:

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   remote 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 \
>   dev vxlan100 dst 2001:db8:3::1

We can replace multicast by head-end replication of BUM frames to a statically configured lists of remote VTEPs:3

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1

The VXLAN is defined without a remote multicast group. Instead, all the remote VTEPs are associated with the all-zero address: a BUM frame will be duplicated to all these destinations. The VXLAN device will still learn remote addresses automatically using source-address learning.

It is a very simple solution. With a bit of automation, you can keep the default FDB entries up-to-date easily. However, the host will have to duplicate each BUM frame (head-end replication) as many times as there are remote VTEPs. This is quite reasonable if you have a dozen of them. This may become out-of-hand if you have thousands of them.

Cumulus vxfld daemon is an example of this strategy (in the head-end replication mode). “Keepalived and unicast over multiple interfaces” shows another usage.

Unicast with static L2 entries

When the associations of MAC addresses and VTEPs are known, it is possible to pre-populate the FDB and disable learning:

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1

Thanks to the nolearning flag, source-address learning is disabled. Therefore, if a MAC is missing, the frame will always be sent using the all-zero entries.

The all-zero entries are still needed for broadcast and multicast traffic (e.g. ARP and IPv6 neighbor discovery). This kind of setup works well to provide virtual L2 networks to virtual machines (no L3 information available). You need some glue to update the FDB entries.

BGP EVPN with FRR is an example of this strategy (see “VXLANBGP EVPN with FRR” for additional information).

Unicast with static L3 entries

4

You may have to apply an IPv6-related patch to the kernel (included in 4.12). Also, to make ICMPv6 work, you need an additional patch (present in 4.15, 4.14.2, and 4.13.16). 

In the previous example, we had to keep the all-zero entries for ARP and IPv6 neighbor discovery to work correctly. However, Linux can answer to neighbor requests on behalf of the remote nodes.4 When this feature is enabled, the default entries are not needed anymore (but you could keep them):

# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning \
>   proxy
# ip -6 neigh add 2001:db8:ff::11 lladdr 50:54:33:00:00:09 dev vxlan100
# ip -6 neigh add 2001:db8:ff::12 lladdr 50:54:33:00:00:0a dev vxlan100
# ip -6 neigh add 2001:db8:ff::13 lladdr 50:54:33:00:00:0b dev vxlan100
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1

This setup eliminates head-end replication. However, protocols relying on multicast won’t work either. With some automation, this is a setup that should work well with containers: if there is a registry keeping a list of all IP and MAC addresses in use, a program could listen to it and adjust the FDB and the neighbor tables.

The VXLAN backend of Docker’s libnetwork is an example of this strategy (but it also uses the next method).

Unicast with dynamic L3 entries

Linux can also notify a program an (L2 or L3) entry is missing. The program queries some central registry and dynamically adds the requested entry. However, for L2 entries, notifications are issued only if:

  • the destination MAC address is not known;
  • there is no all-zero entry in the FDB; and
  • the destination MAC address is not a multicast or broadcast one.

These limitations prevent us to do a “unicast with dynamic L2 entries” scenario.

5

You have to apply an IPv6-related patch to the kernel (included in 4.12) to get appropriate notifications for missing IPv6 addresses. 

First, let’s create the VXLAN device with the l2miss and l3miss options:5

ip -6 link add vxlan100 type vxlan \
   id 100 \
   dstport 4789 \
   local 2001:db8:1::1 \
   nolearning \
   l2miss \
   l3miss \
   proxy

Notifications are sent to programs listening to an AF_NETLINK socket using the NETLINK_ROUTE protocol. This socket needs to be bound to the RTNLGRP_NEIGH group. The following is doing exactly that and decodes the received notifications:

# ip monitor neigh dev vxlan100
miss 2001:db8:ff::12 STALE
miss lladdr 50:54:33:00:00:0a STALE

The first notification is about a missing neighbor entry for the requested IP address. We can add it with the following command:

ip -6 neigh replace 2001:db8:ff::12 \
    lladdr 50:54:33:00:00:0a \
    dev vxlan100 \
    nud reachable

The entry is not permanent so that we don’t need to delete it when it expires. If the address becomes stale, we will get another notification to refresh it.

6

Directly adding the entry after the first notification would have been smarter to avoid unnecessary retransmissions. 

Once the host receives our proxy answer for the neighbor discovery request, it can send a frame with the MAC we gave as a destination. The second notification is about the missing FDB entry for this MAC address. We add the appropriate entry with the following command:6

bridge fdb replace 50:54:33:00:00:0a \
    dst 2001:db8:2::1 \
    dev vxlan100 dynamic

The entry is not permanent either as it would prevent the MAC to migrate to the local VTEP (a dynamic entry cannot override a permanent entry).

This setup works well with containers and a global registry. However, there is a small latency penalty for the first connections. Moreover, multicast and broadcast won’t be available in the underlay network. The VXLAN backend for flannel, a network fabric for Kubernetes, is an example of this strategy.

Decision

There is no one-size-fits-all solution.

You should consider the multicast solution if:

  • you are in an environment where multicast is available;
  • you are ready to operate (and scale) a multicast network;
  • you need multicast and broadcast inside the virtual segments; and
  • you don’t have L2/L3 addresses available beforehand.

The scalability of such a solution is pretty good if you take care of not putting all VXLAN interfaces into the same multicast group (e.g. use the last byte of the VNI as the last byte of the multicast group).

When multicast is not available, another generic solution is BGP EVPNBGP is used as a controller to ensure the distribution of the list of VTEPs and their respective FDBs. As mentioned earlier, an implementation of this solution is FRR. I explore this option in a separate post: VXLANBGP EVPN with FRR.

7

flannel and Docker’s libnetwork were already mentioned as they both feature a VXLAN backend. There are also some interesting experiments like BaGPipe BGP for Kubernetes which leverages BGP EVPN and is therefore interoperable with other vendors. 

If you operate in a container-like environment where L2/L3 addresses are known beforehand, a solution using static and/or dynamic L2 and L3 entries based on a central registry and no source-address learning would also fit the bill. This provides a more security-tight solution (bound resources, MiTM attacks dampened down, inability to amplify bandwidth usage through excessive broadcast). Various environment-specific solutions are available7 or you can build your own.

Other considerations

Independently of the chosen strategy, here are a few important points to keep in mind when implementing a VXLAN overlay.

Isolation

While you may expect VXLAN interfaces to only carry L2 traffic, Linux doesn’t disable IP processing. If the destination MAC is a local one, Linux will route or deliver the encapsulated IP packet. Check my post about the proper isolation of a Linux bridge.

Encryption

VXLAN enforces isolation between tenants, but the traffic is unencrypted. The most direct solution to provide encryption is to use IPsec. Some container-based solutions may come with IPsec support out-of-the-box (notably Docker’s libnetwork, but flannel has a plan for it too). This is quite important for deployment over a public cloud.

Overhead

The format of a VXLAN-encapsulated frame is the following:

VXLAN encapsulation
VXLAN encapsulation inside a UDP datagram (additional overhead: 50 bytes)

VXLAN adds a fixed overhead of 50 bytes. If you also use IPsec, the overhead depends on many factors. In transport mode, with AES and SHA256, the overhead is 56 bytes. With NAT traversal, this is 64 bytes (additional UDP header). In tunnel mode, this is 72 bytes.

8

There is no such thing as MTU discovery on an Ethernet segment. 

Some users will expect to be able to use an Ethernet MTU of 1500 for the overlay network. Therefore, the underlay MTU should be increased. If it is not possible, ensure the inner MTU (inside the containers or the virtual machines) is correctly decreased.8

IPv6

While all the examples above are using IPv6, the ecosystem is not quite ready yet. The multicast L2-only strategy works fine with IPv6 but every other scenario currently needs some patches (1234).

Update (2020-11)

From Linux 4.13, all the mentioned issues are solved. IPv6 is also more well-tested as Cumulus promotes IPv6 as the transport and signaling protocol for their own solutions.

On top of that, IPv6 may not have been implemented in VXLAN-related tools:

Multicast

Linux VXLAN implementation doesn’t support IGMP snooping. Multicast traffic will be broadcasted to all VTEPs unless multicast MAC addresses are inserted into the FDB.

与[转帖]VXLAN & Linux相似的内容:

[转帖]VXLAN & Linux

https://vincent.bernat.ch/en/blog/2017-vxlan-linux VXLAN is an overlay network to carry Ethernet traffic over an existing (highly available and scalab

[转帖]linux 上实现 vxlan 网络

https://cizixs.com/2017/09/28/linux-vxlan/ linux 上 vxlan 简介 vxlan 协议的介绍文章主要介绍了 vxlan 协议的理论知识,从产生的背景到报文的格式等等,和所有的计算机知识一样,理论必须结合实践理解才能更深刻,这篇文章我们就讲讲在 lin

[转帖]Vxlan基础理解

一 . 为什么需要Vxlan 1. vlan的数量限制 4096个vlan远不能满足大规模云计算数据中心的需求 2. 物理网络基础设施的限制 基于IP子网的区域划分限制了需要二层网络连通性的应用负载的部署 3. TOR交换机MAC表耗尽 虚拟化以及东西向流量导致更多的MAC表项 4. 多租户场景 I

[转帖]vxlan 协议原理简介

https://cizixs.com/2017/09/25/vxlan-protocol-introduction/ 1. vxlan 简介 VXLAN 全称是 Virtual eXtensible Local Area Network,虚拟可扩展的局域网。它是一种 overlay 技术,通过三层的

[转帖]Docker容器跨主机通信overlay网络的解决方案

https://www.jb51.net/article/237838.htm 一、Docker主机间容器通信的解决方案 Docker网络驱动 Overlay: 基于VXLAN封装实现Docker原生Overlay网络 Macvlan: Docker主机网卡接口逻辑上分为多个子接口,每个子接口标识一

[转帖]0.03秒引发的网络血案

https://www.jianshu.com/p/45085331b9f0 背景 用户Pike版Openstack,Firewall drivers为Openvswitch。Openstack内一租户网络下多台虚拟机中部署一K8S集群,其中Openstack下租户网络使用VxLAN,K8S集群采用

[转帖]VLAN与三层交换机

目录 一、VLAN概述与优势二、Trunk的作用三、IEEE 802.1q四、VLAN转发五、Trunk的配置六、单臂路由概述七、三层交换机实现VLAN之间通信的原理八、实验一九、实验二十、实验三 一、VLAN概述与优势 在使用交换机互联的以太网中,同一区域内的主机在相互通信时可能会产生广播报文,此

[转帖]

Linux ubuntu20.04 网络配置(图文教程) 因为我是刚装好的最小系统,所以很多东西都没有,在开始配置之前需要做下准备 环境准备 系统:ubuntu20.04网卡:双网卡 网卡一:供连接互联网使用网卡二:供连接内网使用(看情况,如果一张网卡足够,没必要做第二张网卡) 工具: net-to

[转帖]

https://cloud.tencent.com/developer/article/2168105?areaSource=104001.13&traceId=zcVNsKTUApF9rNJSkcCbB 前言 Redis作为高性能的内存数据库,在大数据量的情况下也会遇到性能瓶颈,日常开发中只有时刻

[转帖]ISV 、OSV、 SIG 概念

ISV 、OSV、 SIG 概念 2022-10-14 12:29530原创大杂烩 本文链接:https://www.cndba.cn/dave/article/108699 1. ISV: Independent Software Vendors “独立软件开发商”,特指专门从事软件的开发、生产、