DPDK-22.11.2 [四] Virtio_user as Exception Path

dpdk,virtio,user,as,exception,path · 浏览次数 : 14

小编点评

```c #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include

正文

因为dpdk是把网卡操作全部拿到用户层，与原生系统驱动不再兼容，所以被dpdk接管的网卡从系统层面（ip a/ifconfig）无法看到，同样数据也不再经过系统内核。

如果想把数据再发送到系统，就要用到virtio user。这种把数据从dpdk再发送到内核的步骤，就叫做exception path。

有关virtio user，又有一系列的相关知识，这里系统的介绍一下。

hypervisor

hypervisor是一个软件，用来创建运行虚拟机(virtual machines/VMs)。hypervisor又叫做虚拟机监视器(virtual machine monitor/VMM)。运行hypervisor的机器叫做宿主机(host machine)，在运行在hypervisor上的虚拟机叫做访客机(guest machine)。

hypervisor有两种类型，一种是直接运行在硬件上(Type 1-native or bare-metal hypervisors)，hypervisor相当于操作系统；另一种是hypervisor运行在操作系统上(Type 2-hosted hypervisors)。

常见的hypervisor

hypervisor只是一种解决思路，目的就是为了更大化利用硬件资源。比如有一台计算机，没有虚拟化之前，只能给一个用户使用，然而这个用户不可能24小时在线，空闲时间，系统资源就浪费了。有了虚拟化，就可以把计算机虚拟出多个操作系统，给多个用户使用，更大化的利用系统资源。并且可以根据用户的重要性（付费情况）控制硬件资源的使用占比和优先级。现在的云就是虚拟化的进一步延伸。

VMware hypervisors

VMware hypervisors有两类产品，一种是Type 1，直接运行在硬件上：

ESXi hypervisor/VMware ESXi (Elastic Sky X Integrated)
VSphere hypervisor

另一种是Type 2，运行在操作系统上：

VMware Fusion
Workstation
VirtualBox

Hyper-V hypervisor

Hyper-V hypervisor是微软的产品，用在Windows上，是Type 1类型的，直接运行在硬件上。

Citrix hypervisors

XenServer是Citrix Hypervisor比较有名的产品，是Type 1类型，并且XenServer衍生出了Xen open source project。

Open source hypervisors

主要有KVM和Xen

Hypervisor KVM

Linux直接把kernel-based virtual machine (KVM)加到了系统中，并且对QEMU进行了补充。

Red Hat hypervisor

Red Hat hypervisor是基于KVM hypervisor开发的，同样可以在很多其他Linux版本运行，比如Ubuntu。

虚拟化类型

全虚拟化

由虚拟程序提供全部的虚拟化指令，比如我们用的virtualbox/vmware workstation等桌面虚拟机。好处就是与硬件完全隔离，迁移方便，坏处就是牺牲了性能。

硬件虚拟化

由于全虚拟化性能受到影响，所以又提出了硬件虚拟化，由硬件提供虚拟化方案，虚拟机直接访问硬件，虽然性能得到了提升，但是也产生了弊端：不方便迁移，必须依赖特定硬件，硬件提供的功能不完善，很多操作无法执行。

半虚拟化

为了解决上面的两个问题，又提出了半虚拟化，就是消耗性能的操作交给硬件（比如特定的解码器）或者操作系统，而其他的操作还是在虚拟机中完成。半虚拟化中使用最广泛的标准就是VirtIO。

VirtIO相当于是半虚拟化(paravirtualized hypervisor)的抽象层，有前端和后端，定义了一系列接口用于中间通信。后端相当于硬件或者操作系统层，具体实现可以不同，只要给定相应的接口操作即可；前端通过调用这些接口达到操作系统资源的目的。

这样的话，前端就可以放到虚拟机中，当需要更高性能操作时，通过前端访问后端资源，后端获得数据后发送到前端。

VirtIO Offload 就是通过VirtIO协议把操作卸载到硬件或者操作系统，也就是把一些消耗性能的操作从虚拟机中释放出来，由硬件或者操作系统实现，最后把结果返回虚拟机（比如网络流量处理）。

Deep dive into Virtio-networking

基础知识

网络

NIC (Network Interface Card) - 网卡，就是专门用来offload（卸载）CPU工作的，把一些网络处理交由网卡进行操作。

tun/tap - virtual point-to-point network devices that the userspace applications can use to exchange packets. The device is called a tap device when the data exchanged is layer 2 (ethernet frames), and a tun device if the data exchanged is layer 3 (IP packets).
When the tun kernel module is loaded it creates a special device /dev/net/tun. A process can create a tap device opening it and sending special ioctl commands to it. The new tap device has a name in the /dev filesystem and another process can open it, send and receive Ethernet frames.

IPC Inter-Process Communication

socket、eventfd和共享内存都是IPC的方式

实现方案

virtio-net/Networking with virtio: qemu implementation 基于QEMU的实现

从图上可以看到，qemu中处于guest kernel层的virtio net与qemu的virtio net通信，qemu的virtio net最后与系统kernel层的tap通信。中间经历了多次user space和kernel space的切换，并且使用的是系统默认的驱动，还有大量的中断处理，所以性能不高。

Vhost protocol

由于上面方案的局限性，vhost提出了改进，就是把消耗性能的模块，offload到另一个模块执行。换句话说，虚拟机不适合做的工作，就交给其他模块做，通过一些通信手段交互数据即可。

Vhost-net

Vhost-net就是对vhost协议的一种实现。这个功能已经集成到linux内核中。如果相关的内核模块加载后，可以在系统路径下看到/dev/vhost-net目录。

从这张图上我们可以看到，原来通信流程是qemu guest kernel中的virtio-net->qemu virtio-net->host kernel中的tap。现在中间少了一步，通过IPC(Inner-process communication)直接到host kernel的vhost-net，提高了性能。

vhost-user

上面的方案是通过共享内存的方式，映射到内核，但是还是有上下文切换。vhost-user把操作完全放到用户层，使用socket的方式与内核通信，没有了上下文切换，也降低了开发难度。

上面这种图可以看到，操作都被移动到用户层，使用DPDK避免了上下文切换和中断，大大提高了性能。

virtio-user

按照官方文档所述，virtio-user是与vhost-user一起引入的。vhost-user作为后端，virtio-user作为前端。virtio-user除了可以用在容器，与vhost-user一起使用，还可以与vhost-kernel使用，把数据包发送回操作系统。

硬件加速

HW vDPA(Hardware vhost Data Path Acceleration)是SR-IOV VF Passthrough的一种实现。

最快的肯定是直接使用硬件作为后端，把操作直接交给硬件。但是基于硬件的局限性比较大，功能也不如其他方式丰富，并且成本昂贵，所以除非在对性能要求非常高的场合，一般不会直接使用专有硬件作为后端。

Exception Path的方案介绍

TAP/TUN方案

这个是最早的方案，通过系统的TAP/TUN进行通信，调用的系统标准的api，缺点就是上下文切换和中断影响了性能。

KNI Kernel NIC Interface

KNI比TAP/TUN的好处就是减少了数据拷贝，可以支持linux系统管理工具（ethtool等）。

但是缺点就是，已经过时了，不安全，功能不全。

virtio user

virtio user用来代替kni，其优点是：

被linux加入内核，不需要额外维护
功能更完善
性能更高

如下图是virtio user的基本流程示意图

数据由NIC（网卡）到DPDK的PMD处理程序，通过virtio与系统内核进行数据和控制信息交换。也就是把从PMD获取的数据，通过virtio发送到系统内核，前端是virtio-user，后端是系统的vhost-net。

使用Testpmd测试virtio-user

build/app/dpdk-testpmd -l 12-15 -a 0000:84:00.0 \
    --vdev=virtio_user0,path=/dev/vhost-net,queues=1,queue_size=1024 -- --numa
复制

-l 12-15 表示使用cpu core12到15
-a 0000:84:00.0 表示使用指定的网口，该网口必须有流量进来。
--vdev=virtio_user0,path=/dev/vhost-net,queues=1,queue_size=1024 表示创建一个虚拟设备，设备名是virtio_user0，路径是/dev/host-net（这样就可以把数据发送给系统了），queues=1表示通信队列有1个，queue_size=1024表示队列大小是1024。

启动后，通过ip a，可以看到多了一个tap0的设备。上面指定的virtio_user0表示是使用的时候的名称，至于系统显示的名称没有指定，就会默认为tapx。

ip a
...
69: tap0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:e0:f5:1f:21:5f brd ff:ff:ff:ff:ff:ff
复制

设备创建出来后是down状态，需要up起来。官方示例指定了ip，实际上如果只是查看是否有接收数据，可以不用指定ip。

ip link set dev tap0 up
复制

在通过ifconfig查看详细信息

ifconfig tap0
tap0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::18e0:f5ff:fe1f:215f  prefixlen 64  scopeid 0x20<link>
        ether 1a:e0:f5:1f:21:5f  txqueuelen 1000  (Ethernet)
        RX packets 1175788  bytes 947947134 (904.0 MiB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 6  bytes 516 (516.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
复制

可以看到有数据传递进来。

如果有多个网口可以指定多个，这样就会有两个虚拟设备tap0和tap1。

build/app/dpdk-testpmd -l 12-15 -a 0000:84:00.0,0000:84:00.1 \
    --vdev=virtio_user0,path=/dev/vhost-net --vdev=virtio_user1,path=/dev/vhost-net -- --numa
复制

另起一个进程，指定tap0为接收设备，就可以接收到数据。

build/app/dpdk-testpmd -l 2-5 --vdev=net_af_packet0,iface=tap0 --in-memory --no-pci
复制

使用basicfwd修改一个手动创建虚拟设备的示例

由于上面的参数-vdev是DPDK提供的，已经实现好的功能，我们不能控制。如果想动态的自己创建虚拟设备，可以使用DPDK提供的api rte_eal_hotplug_add，动态的增加一个设备。设备增加成功后，与其他物理网口没有任何区别，按照正常流程初始化，启动设备即可。官方文档也给出了实例代码。

#include <stdint.h>
#include <stdlib.h>
#include <inttypes.h>
#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_cycles.h>
#include <rte_lcore.h>
#include <rte_mbuf.h>
#include <rte_config.h>
#include <rte_ethdev.h>
#include <unistd.h>
#define RX_RING_SIZE 1024
#define TX_RING_SIZE 1024

#define NUM_MBUFS 8191
#define MBUF_CACHE_SIZE 250
#define BURST_SIZE 32
uint16_t virport[64];
int virportnum = 0;
struct lcore_conf
{
    unsigned n_rx_port;
    unsigned rx_port_list[16];
    int pkts;
} __rte_cache_aligned;

static struct lcore_conf lcore_conf_info[RTE_MAX_LCORE];

static inline int port_init(uint16_t port, struct rte_mempool *mbuf_pool)
{
    uint16_t portid = port;
    struct rte_eth_conf port_conf;
    uint16_t nb_rxd = RX_RING_SIZE;
    uint16_t nb_txd = TX_RING_SIZE;
    int retval;
    uint16_t q;
    struct rte_eth_dev_info dev_info;
    int istx=0;

    if (!rte_eth_dev_is_valid_port(port))
        return -1;
    // 需要判断是否是虚拟网卡
    // 因为动态创建的网卡也会遍历进来，需要额外处理
    for (int i = 0; i < virportnum; i++)
    {
        if (port == virport[i])
        {
            istx=1;
            break;
        }
    }
    uint16_t rx_rings = 0, tx_rings = 0;
    if (istx == 1)
    {
        tx_rings = 1;
    }
    else
    {
        rx_rings = 1;
    }

    memset(&port_conf, 0, sizeof(struct rte_eth_conf));

    retval = rte_eth_dev_info_get(port, &dev_info);
    if (retval != 0)
    {
        printf("Error during getting device (port %u) info: %s\n",
               port, strerror(-retval));
        return retval;
    }

    if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE)
        port_conf.txmode.offloads |=
                RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE;

    retval = rte_eth_dev_configure(port, rx_rings, tx_rings, &port_conf);
    if (retval != 0)
        return retval;

    retval = rte_eth_dev_adjust_nb_rx_tx_desc(port, &nb_rxd, &nb_txd);
    if (retval != 0)
        return retval;
    // 创建的虚拟设备与物理设备没有区别，都需要初始化
    // 如果是物理设备，就是接收数据；如果是虚拟设备，就是发送数据
    if (istx == 0)
    {
        for (q = 0; q < rx_rings; q++)
        {
            retval = rte_eth_rx_queue_setup(port, q, nb_rxd, rte_eth_dev_socket_id(port), NULL, mbuf_pool);
            if (retval < 0)
                return retval;
            retval = rte_eth_dev_set_ptypes(port, RTE_PTYPE_UNKNOWN, NULL, 0);
            if (retval < 0)
            {
                    printf("Port %u, Failed to disable Ptype parsing\n", port);
                    return retval;
            }
        }
    }
    else
    {
        for (q = 0; q < tx_rings; q++)
        {
            retval = rte_eth_tx_queue_setup(port, q, nb_txd, rte_eth_dev_socket_id(port), NULL);
            if (retval < 0)
                return retval;
        }

    }

    retval = rte_eth_dev_start(port);
    if (retval < 0)
        return retval;

    char portname[32];
    char portargs[256];

    struct rte_ether_addr addr;
    retval = rte_eth_macaddr_get(port, &addr);
    if (retval != 0)
        return retval;

    printf("Port %u MAC: %02" PRIx8 " %02" PRIx8 " %02" PRIx8 " %02" PRIx8 " %02" PRIx8 " %02" PRIx8 "\n", port, RTE_ETHER_ADDR_BYTES(&addr));

    // 如果是物理设备，就创建一个对应的虚拟设备
    if(istx==0)
    {
        snprintf(portname, sizeof(portname), "virtio_user%u", port);
        // 修改一下mac，避免与物理设备一致
        addr.addr_bytes[5]=1;
        // 创建虚拟设备参数，指定路径，设备名称，mac地址等
        snprintf(portargs, sizeof(portargs), "path=/dev/vhost-net,queues=1,queue_size=%u,iface=%s,mac=" RTE_ETHER_ADDR_PRT_FMT, RX_RING_SIZE, portname, RTE_ETHER_ADDR_BYTES(&addr));
        
        // 把设备加入到系统
        if (rte_eal_hotplug_add("vdev", portname, portargs) < 0)
            rte_exit(EXIT_FAILURE, "Cannot create paired port for port %u\n", port);

        uint16_t virportid = -1;
        // 通过设备名称获取设备id
        if (rte_eth_dev_get_port_by_name(portname, &virportid) != 0)
        {
            rte_eal_hotplug_remove("vdev", portname);
                rte_exit(EXIT_FAILURE, "cannot find added vdev %s:%s:%d\n", portname, __func__, __LINE__);
        }
        // 记录下虚拟设备id
        virport[virportnum] = virportid;
        virportnum++;
    }
    
    // 虚拟设备不可以开启混杂模式
    if(istx==0)
    {
        retval = rte_eth_promiscuous_enable(port);
        if (retval != 0)
            return retval;
        for (int i = 0; i < RTE_MAX_LCORE; i++)
        {
            if (rte_lcore_is_enabled(i) == 0)
            {
                continue;
            }

            if (i == rte_get_main_lcore())
            {
                continue;
            }

            if (lcore_conf_info[i].n_rx_port > 0)
            {
                continue;
            }

            struct lcore_conf *qconf = &lcore_conf_info[i];
            qconf->rx_port_list[qconf->n_rx_port] = port;
            qconf->n_rx_port++;
            break;
        }
    }

    return 0;
}

static int lcore_main(void *param)
{
    int ret;
    int lcore_id = rte_lcore_id();
    struct lcore_conf *qconf = &lcore_conf_info[lcore_id];

    int master_coreid = rte_get_main_lcore();
    uint16_t port;
    if (qconf->n_rx_port == 0)
    {
        printf("lcore %u has nothing to do\n", lcore_id);
        return 0;
    }

    if (lcore_id == rte_get_main_lcore())
    {
        printf("do not receive data in main core\n");
        return 0;
    }

    RTE_ETH_FOREACH_DEV(port)
    if (rte_eth_dev_socket_id(port) >= 0 &&
        rte_eth_dev_socket_id(port) !=
        (int) rte_socket_id())
        printf("WARNING, port %u is on remote NUMA node to "
               "polling thread.\n\tPerformance will "
               "not be optimal.\n", port);

    printf("\nCore %u forwarding packets. [Ctrl+C to quit]\n", rte_lcore_id());
    uint16_t portid;
    for (;;)
    {
        for (int i = 0; i < qconf->n_rx_port; i++)
        {
            int port = qconf->rx_port_list[i];
            portid = port;
            struct rte_mbuf *bufs[BURST_SIZE];
            uint16_t nb_rx = rte_eth_rx_burst(port, 0, bufs, BURST_SIZE);

            if (unlikely(nb_rx == 0))
                continue;
            uint16_t nb_tx = 0;
            for (int i = 0; i < virportnum; i++)
            {
                // 找一个虚拟网卡发送出去
                // 这里只有一个设备，可以这样
                // 如果有多个，需要设定好一一对应关系再发送
                nb_tx = rte_eth_tx_burst(virport[i], 0, bufs, nb_rx);
                break;
            }

            for (int j = nb_tx; j < nb_rx; j++)
            {
                // 数据发送完后，会自动释放，没有发送的数据，需要手动释放
                rte_pktmbuf_free(bufs[j]);
            }
        }
    }

    return 0;
}

int main(int argc, char *argv[])
{
    struct rte_mempool *mbuf_pool;
    unsigned nb_ports;
    uint16_t portid;
    memset(lcore_conf_info, 0, sizeof(lcore_conf_info));
    memset(virport, -1, sizeof(virport));

    int ret = rte_eal_init(argc, argv);
    if (ret < 0)
        rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");

    nb_ports = rte_eth_dev_count_avail();
    mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", NUM_MBUFS * nb_ports, MBUF_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());

    if (mbuf_pool == NULL)
        rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");

    // 这里遍历需要注意，遍历期间动态创建的虚拟设备也会被遍历到
    RTE_ETH_FOREACH_DEV(portid)
    if (port_init(portid, mbuf_pool) != 0)
        rte_exit(EXIT_FAILURE, "Cannot init port %" PRIu16 "\n", portid);

    rte_eal_mp_remote_launch(lcore_main, NULL, SKIP_MAIN);
    int lcore_id;
    RTE_LCORE_FOREACH_WORKER(lcore_id)
    {
        if (rte_eal_wait_lcore(lcore_id) < 0)
        {
            ret = -1;
            break;
        }
    }

    rte_eal_cleanup();

    return 0;
}
复制

编译运行，通过ip a查看

ip a
...
70: virtio_user0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:e0:f5:1f:21:01 brd ff:ff:ff:ff:ff:ff
复制

可以看到该设备，因为指定了名称，则不再是tap0，而是我们指定的virtio_user0。mac地址也是我们指定的。

开启设备，再次查看信息

ip link set dev virtio_user0 up

ip a
70: virtio_user0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 1a:e0:f5:1f:21:01 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::92e2:baff:fe85:3d01/64 scope link tentative 
       valid_lft forever preferred_lft forever
复制

查看网卡接收数据包信息

ifconfig virtio_user0
virtio_user0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::92e2:baff:fe85:3d01  prefixlen 64  scopeid 0x20<link>
        ether 1a:e0:f5:1f:21:01  txqueuelen 1000  (Ethernet)
        RX packets 2899366  bytes 2334954577 (2.1 GiB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
复制

http://doc.dpdk.org/guides-22.11/howto/virtio_user_as_exception_path.html
https://www.redhat.com/en/topics/virtualization/what-is-a-hypervisor
https://en.wikipedia.org/wiki/Hypervisor
https://www.ibm.com/topics/hypervisors
https://aws.amazon.com/cn/what-is/hypervisor/
https://developer.ibm.com/articles/l-virtio/
https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
https://www.redhat.com/en/blog/deep-dive-virtio-networking-and-vhost-net
https://qemu-project.gitlab.io/qemu/interop/vhost-user.html
https://www.redhat.com/en/blog/journey-vhost-users-realm
https://mp.weixin.qq.com/s/q3qAaMBGyQ5E2_2Dd-IvdA
https://www.cnblogs.com/bakari/p/8971710.html
https://doc.dpdk.org/guides-18.08/sample_app_ug/exception_path.html
https://doc.dpdk.org/guides/prog_guide/kernel_nic_interface.html

DPDK-22.11.2 [四] Virtio_user as Exception Path

小编点评

正文

hypervisor

常见的hypervisor

VMware hypervisors

Hyper-V hypervisor

Citrix hypervisors

Open source hypervisors

Hypervisor KVM

Red Hat hypervisor

虚拟化类型

全虚拟化

硬件虚拟化

半虚拟化

Deep dive into Virtio-networking

基础知识

网络

IPC Inter-Process Communication

实现方案

virtio-net/Networking with virtio: qemu implementation 基于QEMU的实现

Vhost protocol

Vhost-net

vhost-user

virtio-user

硬件加速

Exception Path的方案介绍

TAP/TUN方案

KNI Kernel NIC Interface

virtio user

使用Testpmd测试virtio-user

使用basicfwd修改一个手动创建虚拟设备的示例

与DPDK-22.11.2 [四] Virtio_user as Exception Path相似的内容：

DPDK-22.11.2 [四] Virtio_user as Exception Path

[转帖]什么是DPDK？DPDK的原理及学习学习路线总结

[转帖]网络转发性能测试方法 ( l3fwd, ovs-dpdk )

[转帖]网卡多队列：RPS、RFS、RSS、Flow Director（DPDK支持）

[转帖]【dperf系列-5】使用dperf进行性能测试（初级）

[转帖]VMWare ESXi中，不同的虚拟网卡性能竟然能相差三倍！

# 热门排行