In some of my recent research in NFV, I’ve needed to expose Docker containers to the host’s network, treating them like fully functional virtual machines with their own interfaces and routable IP addresses. This type of exposure is overkill for many applications, but necessary for user space packet processing such as that required for NFV. An example use case might be if you want to give your containerized virtual firewall direct access to a physical interface, bypassing the OVS/bridge and the associated overhead, but without the security vulnerabilities of --net=host
.
You have a few options in this kind of situation. You could directly assign the interface, giving the container exclusive access. The number of available physical NICs is limited, though, so a more realistic option for most of us is to virtualize the interface, and let the container think it has a real NIC. Jérôme Petazzoni’s pipework script makes it a breeze to do this; by default, if you assign a physical interface to a container with pipework, it will create a virtual interface, place it under a fast macvlan L2 switch, and assign the virtual interface to the container’s network namespace. This comes with a cost, of course: macvlan is still a software switch, another layer between your container and the NIC.
A third option is to use Single Root IO Virtualization (SR-IOV), which allows a PCI device to present itself as multiple virtual functions to the OS. Each function acts as a device and gets its own MAC, and the NIC can then use its built in hardware classifier to place incoming packets in separate RX queues based on the destination MAC. Scott Lowe has a great intro to SR-IOV here.
There are a few reasons that you might want to use SR-IOV rather than a macvlan bridge. There are architectural benefits to moving the packet processing from the kernel to user space – it can improve cache locality and reduce context switches (Rathore et al, 2013). It can also make it easier to quantify and control the CPU usage of the user space application, which is critical in multi tenant environments. On the other hand, there are situations where you would definitely not want to use SR-IOV – primarily when you have containers on the same host that need to communicate with each other. In this case, just as with macvlan passthru mode, packets must be copied to the NIC to be switched back to the host, which has a pretty dismal performance penalty.
So, let’s dig in and make it happen. You’ll want to make sure that the NIC you intend to virtualize actually supports SR-IOV, and how many virtual functions are supported. I’m working with some Dell C8220s with Intel 82599 10G NICs, which support up to 63 virtual functions. Here’s a list of Intel devices with support, other manufacturers should have their own lists.
Creating Virtual NICs
First, get a list of your available NICs. Here’s a handy one liner:
$ lspci -Dvmm|grep -B 1 -A 4 Ethernet
This gives you the PCI slot, class, and other useful information of your Ethernet devices, like this:
Slot: 0000:02:00.0 Class: Ethernet controller Vendor: Intel Corporation Device: I350 Gigabit Network Connection SVendor: Dell SDevice: Device 0518 -- Slot: 0000:82:00.0 Class: Ethernet controller Vendor: Intel Corporation Device: 82599ES 10-Gigabit SFI/SFP+ Network Connection SVendor: Inventec Corporation SDevice: Device 004c
In this case, I’m going to be virtualizing the 10G NIC, so I note the slot: 0000:82:00.0. Next, decide how many virtual functions you’re going to need, in addition to the physical device. I’m going to be assigning interfaces to 2 Docker containers, so I’ll create 2 VFs. Next, we’ll just write that number into the sriov_numvfs file for the device:
$ echo 2 > /sys/bus/pci/devices/0000:82:00.0/sriov_numvfs
Now, check ifconfig -a
. You should see a number of new interfaces created, starting with “eth” and numbered after your existing interfaces. They’ve been assigned random MACs, and are ready for you to use. Here’s mine:
$ ifconfig -a ... eth4 Link encap:Ethernet HWaddr 7e:17:0f:de:7c:05 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) eth5 Link encap:Ethernet HWaddr ba:ad:ac:43:31:14 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Plumb It
My preferred tool to add interfaces to Docker containers is with pipework, but in this case, pipework will virtualize the virtual interface with a macvlan bridge. As a workaround, I forked the pipework repository and made it accept --direct-phys
as the first argument, to force it to skip the macvlan and bring the interface directly into the container’s network namespace. I’ve upstreamed the change, and if it makes its way into the original project, I’ll update this post.
First, I’ll make a container for testing:
$ server=$(docker run -d rakurai/netperf netserver -D)
Now, let’s give that container a new virtual NIC, with the modified pipework:
$ sudo pipework --direct-phys eth4 $server 10.10.1.4/24
By default, pipework will name the new interface eth1
inside the container (Note: see bottom of post for one caveat). Just to double check:
$ docker exec -it $server ifconfig ... eth1 Link encap:Ethernet HWaddr 7e:17:0f:de:7c:05 inet6 addr: fe80::7c17:fff:fede:7c05/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5 errors:0 dropped:0 overruns:0 frame:0 TX packets:31 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:414 (414.0 B) TX bytes:2406 (2.4 KB)
Note that the MAC is the same as ifconfig on the host, and that also the interface no longer visible in the host’s ifconfig: this is because the interface is now in the container’s network namespace. Now, to try it out with another physical machine on that interface’s network:
$ docker exec -it $server ping 10.10.1.1 -c 1 PING 10.10.1.1 (10.10.1.1) 56(84) bytes of data. 64 bytes from 10.10.1.1: icmp_seq=1 ttl=64 time=0.204 ms
If you like doing things the hard way, here’s the steps to mimic how pipework put the interface in the container’s network namespace:
$ IP_ADDR=10.10.1.4/24 $ HOST_IFACE=eth4 $ CONT_IFACE=eth1 $ NAMESPACE=$(docker inspect --format='{{ .State.Pid }}' $server) $ ip link set $HOST_IFACE netns $NAMESPACE $ ip netns exec $NAMESPACE ip link set $HOST_IFACE name $CONT_IFACE $ ip netns exec $NAMESPACE ip addr add $IP_ADDR dev $CONT_IFACE $ ip netns exec $NAMESPACE ip link set $CONT_IFACE up
EDIT:
One issue that you may have with this approach happens when you stop the container. If you’ve renamed the interface to something else, like pipework and my above example do, there will be a conflict when the kernel tries to move the interface back to the host’s namespace. The simplest solution would just be to avoid renaming the interface, unless it’s critical that the interface be named something specific within the container. This is pretty easy with pipework, just specify the container interface name:
$ sudo pipework --direct-phys -i eth4 eth4 $server 10.10.1.4/24
Let me know how it works out for you in the comments below.
Links:
Jérôme Petazzoni’s pipework
Modified pipework with –direct-phys option
Scott Lowe on SR-IOV
Intel devices that support SR-IOV