Sunday, September 6, 2009

Nested vmx support coming to kvm

Almost exactly a year ago I reported on nested svm for kvm - a way to run hypervisors as kvm guests, on AMD hosts. I'm happy to follow up with the corresponding feature for Intel hosts - nested vmx.

Unlike the nested svm patchset, which was relatively simple, nested vmx is relatively complex. This is due to several reasons:

  • While svm uses a memory region to communicate between hypervisor and processor, vmx uses special instructions -- VMREAD and VMWRITE. kvm must trap and emulate the effect of these instructions, instead of allowing the guest to read and write as it pleases.
  • vmx is significantly more complex than svm: vmx uses 144 fields for hypervisor-to-processor communications, while svm gets along with just 91. All of those fields have to be virtualized. Note that nested virtualization must reconcile the way kvm uses those fields with the way its guest (which is also a hypervisor) uses those fields; this causes complexity to increase even more.
  • The nested vmx patchset implements support for Extended Page Tables (EPT) in the guest hypervisor, in addition to existing support in the host. This means that kvm must now support guest pagetables in the 32-bit format, 64-bit format, and now the EPT format.

Support for EPT in the guest deserves special mention, since it is critical for obtaining reasonable performance. Without nested EPT, the guest hypervisor will have to trap writes to guest page tables and context switches. The the guest hypervisor has to service those traps - by issuing the VMREAD and VMWRITE to communicate with the processor. Since those instructions must trap to kvm, any trap taken by the guest is multiplied by quite a large factor into kvm traps.

So how does nested EPT work?

Without nesting, EPT provides for two levels of address translation:
  1. The first level is managed by the guest, and translates guest virtual addresses (gva) to guest physical addresses (gpa).
  2. The second address translation level translates guest physical addresses into host physical adresses (hpa). This second level is managed by the host (kvm).

When nesting is introduced, we now have three levels of address translation:
  1. Nested guest virtual address (ngva) to nested guest physical address (ngpa) (managed by the nested guest)
  2. Nested guest physical address (ngpa) to guest physical address (gpa) (managed by the guest hypervisor)
  3. Guest physical address (gpa) to host physical address (hpa) (managed by the host - kvm)
Given that the hardware only supports two levels of address translation, we need to invoke software wizardry. Fortunately, we already have code in kvm that can fold two levels of address translation into one - the shadow mmu.

The shadow mmu, which is used when EPT or NPT are not available, folds the gva→gpa→hpa translation into a single gva→hpa translation which is supported by hardware. We can reuse this code to fold the ngpa→gpa→hpa translation into a single ngpa→hpa. Since the hardware supports two levels, it will happily translate ngva→ngpa→hpa.

But what about performance? Weren't NPT and EPT introduced to solve performance problems with the shadow mmu? Shadow mmu performance depends heavily on the rate of change of the two translation levels folded together. Virtual address translations (gva→gpa or ngva→ngpa) do change very frequently, but physical address translations (ngpa→gpa or gpa→hpa) change only rarely, usually in response to a guest starting up or swapping activity. So, while the code is complex and relatively expensive, it will only be invoked rarely.

To summarize, nested vmx looks to be one of the most complicated features in kvm, especially if we wish to maintain reasonable performance. It is expected that it will take Orit Wasserman and the rest of the IBM team some time to mature this code, but once this work is complete, kvm users will be able to enjoy another unique kvm feature.