In virtualization technologies, there are a lot of terminologies starting with
letter v
, this article gives a very brief introduction to these words from my
understandings. Checks the links at the end for more details.
Virtio and virtqueue
Virtio1 is a standard2 to implement virtual devices. The name comes from “virtual IO”. A virtual device is different from an emulated real device. Consider a network interface card. An OS uses drivers to operate a real NIC to send and receive packets. To emulate such a real NIC in a virtual machine, developers have to implement every detail of this NIC to make it appear as a bare-metal NIC to the VM’s OS. This reverse engineering process is troublesome and the emulated NIC usually does not have a satisfying performance.
To enhance performance and also simplify a virtual machine, para-virtualization and virtual devices were introduced. Para-virtualization means we do not emulated a full bare-metal virtual machine, instead, the VM’s OS is informed that it is running in a virtualized environment. For example, the network packets generated by the VM are usually forwarded to the host OS for further processing, thus in the simplest case, the VM can just put the packet data in a buffer and then instruct the host OS to handle it. A fully-emulated NIC is not necessary within the VM.
Virtio established a common framework for implementing a virtual NIC, block device, random number generator, etc. The mechanism for the guest OS and the VMM to share data is virtqueue, a ring buffer. Virtual machine softwares (QEMU, VMWare, etc.) presents virtio devices to the guest OS and the guest OS uses special virtio drivers (different from the drivers for a real NIC) to operate virtio devices. When data is to be sent out, the guest drivers place data in the out-queue and trigger an VM exit. Conversely when data is received, the virtual machine software (QEMU, VMWare, etc.) inserts it to the in-queue and injects an interrupt to the guest OS.
Vhost
Vhost3 is a mechanism, or a protocol, for off-loading the data plane of virtio devices to another process. Why do we need off-loading? from my understandings, two reasons: performance and modularization.
Performance: kernel /dev/vhost-*
devices
Some backgrounds: as the para-virtualization techniques develop, the CPU manufactures add a special set of instructions to realize hardware-assisted virtualization. These instructions are only available in ring-0, while most VMM softwares (except type-1 hypervisors) run in the user space. Thus OSes usually provides APIs (or system calls) for use space programs to access these CPU features. For example, Linux has kvm, macOS provides the Hypervisor framework4.
In the kvm case, when a VM wants to send data out, it first triggers an VM-exit, so control flow returns to kvm, kvm further returns to the user space VMM, for example, QEMU. In most cases, QEMU will again call another system call to handle the data. For example, the data is a network packet and QEMU writes the data to a tap device. In these procedures we have two kernel-user space transitions:
- kvm(kernel) -> qemu(user)
- qemu(user) -> system call
write()
(kernel)
Kernel-user space transitions are expensive. Therefore /dev/vhost-*
devices
are introduced. These devices run in the host kernel space. User space VMMs
delegate most VM data in/out operations to these devices through kvm’s irqfd and
ioeventfd. This eliminates a lot of unnecessary kernel-user space transitions.
With the example above, for a network packet, now the control flow is like
- kvm(kernel) -> vhost-net (kernel)
- vhost-net (kernel) -> tap device (kernel)
Modularization: vhost-user
The example above off-loads the data processing to the guest kernel. On the other hand, vhost-user off-loads data processing to another user space program, which is called a backend device. The VMM and the device talk through a unix domain socket. The benefit here is, now a backend device can be used by different VMMs (QEMU, crosvm, Cloud Hypervisor, etc.) as long as the VMM implements the vhost protocol. New VMMs do not need to implement virtio devices again.
Another benefit is security. A VMM usually needs to call many different kind of of system calls while a backend device does not need that many. Developers can thus sandbox the backend device process in a more fine-grained way, i.e., use a more specific seccomp filter.
On the other hand, some virtio devices have to run as root (for example, virtiofs5). With vhost-user, we just need to give privileges to the device processes, while keeping the VMM process still a non-privileged one.
Vsock
Vsock6 stands for virtual sockets. Vsock is a mechanism for
programs in the VM talking to programs in the host OS. Compared with other
inter-process communication methods, for example, unix domain sockets, vsock can
go across the VM boundary. However, compared with TCP/IP, vsock cannot go beyond
a physical machine. Vsock is also much simpler to configure. Vsock uses a
context id, which is just a number to address VMs. A context id is similar to
the IP address in TCP/IP, but no routing tables are necessary. The host has
context id 2
. A client program in a VM can reach a server program listening on
vsock port $PORT
in the host by dialing 2:$PORT
.
On Linux, as I know, there are 2 different kind of implementations of vsock on the host side.
- vhost-vsock: This implementation relies on the vhost protocol talked above.
Host applications can call
socket(AF_VSOCK, socket_type, 0)
to create a vsock talking to a guest application. QEMU and crosvm use this model. - Firecraker model7: vsocks on the host side are mapped to unix domain sockets. Host applications listen or dial these sockets and the VMM further forwards data into the VM. Firecraker and Cloud Hypervisor use this model.
On the guest side, it’s the same, programs just need to call
socket(AF_VSOCK, socket_type, 0)
to create a virtual socket.