[RFC 0012] Declarative virtual machines #12

Ekleog · 2017-04-08T16:29:41Z

Here is a proposition of a NixOS module to declaratively handle virtual machines.

Rendered: https://github.com/Ekleog/nixos-rfcs/blob/virtual-machines/rfcs/0012-declarative-virtual-machines.md

Having started work on it quite a while ago (2017-02-25 acc. to my git history), I have something working that almost fits this RFC in a ~600 lines module file and a few additions of functions to the nixpkgs lib (for auto-generating the IPs, as they could be used elsewhere), so concerns like "this will be hard to implement" should not be too much of an issue :) (that's not saying I'd drop the support if things needed to change, I'm perfectly willing to rework it so that it fits the bill)

Hope this may help!

cc @Nadrieril, and thank you for review and harsh criticism whenever I did something wrong 👍

Edit: Adding a link to the current state of the implementation: NixOS/nixpkgs@master...Ekleog:vms

copumpkin · 2017-05-08T13:48:18Z

I haven't read it in depth, but love the idea so far. Have you considered trying to abstract over containers and VMs using libvirt?

Ekleog · 2017-05-08T17:56:22Z

That's a really good point: we considered using libvirt to abstract over different kinds of virtualizers, but figured out that only one was actually needed, and the module in itself was already a "virtualizer abstraction layer". As it also adds one level of potential failure and, mostly, I don't know anything about libvirt, at the time the project started I went with qemu.

Now, abstracting also over containers would actually bring net value, by more or less merging the two modules together, and as they have different objectives (lightweight vs. heavyweight virtualization) it makes sense to have both.

However, I'm not sure the implementation could share much, exactly because the goals are different. I'll take the features described in the RFC and try to see how it could be merged with a containers implementation:

memorySize, vcpus: same as for containers
diskSize doesn't seem to make sense for containers [1]
Managing the disk of the guest (mostly its /nix/store): containers being lightweight, it makes sense for them to mount directly over the host's store, however this would leak information between VMs if used for VMs, hence the process to mount the store must differ (and is one of the major pain points, at least in my current implementation [2])
Shared directories: same as for containers
Networking (the second major pain point) appears quite different than what is currently proposed with containers to me, especially given containers.<name>.privateNetwork that would be very hard to impossible to replicate with libvirt
Security part (ie. running qemu as non-root) is irrelevant for containers
Nix support is unimplemented for the time being for VMs, I don't know its exact state for containers, could be a great way to share code (but I was thinking nix 1.12 and especially nixos-prepare-root would bring the same improvements)

Besides, I just learned while writing this answer that imperative containers rely on the inner working of the containers module [3], so replacing the containers module with something based on libvirt would be highly nontrivial, I guess.

Just to get an idea, I counted the lines of code in [2] that could be shared with containers using libvirt (assuming it's equally hard to write modules for qemu and libvirt). The answer is (excluding option definitions) 24: 2 for memorySize and vcpus and 22 for shared.

Obviously, this count is biased towards lower values: the current implementation does only the strict minimum and future additional features may be more factorizable than the current ones.

However, I think there is a wide enough gap between containers and VMs to make it maybe even harder to factor the two than to maintain a separate implementation for each. Do you see another place for factorization I missed?

[1] https://libvirt.org/drvlxc.html

[2] NixOS/nixpkgs@master...Ekleog:vms

[3] NixOS/nixpkgs#3021

Ekleog · 2017-05-08T18:32:00Z

Jumping to another subject while I'm at it: I brought this link to the ML [1].

From there two major points emerged.

First one, there are not enough tools to share code between different VM appliances. In my opinion, this one will be in a big part solved by nixos-prepare-root after nix 1.12, the remainder being mostly virtualization-system-specific. Then, it's the same point as the one you are raising, and I wonder if I'm not just not seeing possibilities for code factorization.

The second one was about the format for disk management and boot control. I will take elements back from [2]. The basic requirement is that the host be able to control the guest's boot, mostly for upgrades and downgrades. This can be achieved in three ways:

Having the guest boot, then wait in the initrd for its configuration, have the host push it, then the guest can continue booting. This requires nothing special on the guest's FS.
Having the guest's /nix on a separate .qcow2 image. This way, the host can decide to stop the guest, upgrade the store, then restart the guest.
Having the guest's /nix on a virtfs. This way, the host can upgrade the guest's store in-place.

There are drawbacks to all of these options:

makes the boot scheme complex to understand, and risks duplicating some behaviour between the dropbear in the initrd and the ssh daemon outside of it, in order to handle online upgrades
makes it really complex to handle online upgrades
may be slower and less stable (even though I haven't experienced unstability during the development), and makes it hard to control the guest's store size.

Besides, options 1 and 2 would I think be made way easier by the presence of nixos-prepare-root, which hints for waiting for it to be released with nix 1.12.

As you will have understood, my current favorite is option 3, and after the coming of nixos-prepare-root, option 1 would I think have my vote.

Do you see another decision argument? Another scheme for handling upgrades and downgrades effectively? What would be your choice?

[1] https://www.mail-archive.com/nix-dev@lists.science.uu.nl/msg36307.html

[2] https://www.mail-archive.com/nix-dev@lists.science.uu.nl/msg36354.html

danbst · 2017-05-26T19:19:50Z

What probably is lacking, is support for multiple disks. For example, NixOps project allows this for virtualbox backend, and same is requested for libvirtd backend.

Ekleog · 2017-05-26T20:27:42Z

Hmm, you are right and this feature can be useful/needed (to take back the example you gave on IRC, for having multiple disks on multiple VMs for redundancy for openstack).

However, I think it is more important to have something agreed on on which we can later on build additional features, than trying to have something perfect from the beginning on -- the important point being that all features that are included in a RFC should have a very low probability of changing, as they will eventually enter a stable version and stable versions should be as backwards-compatible as possible.

So I think your point is a great one, and we should definitively investigate handling multiple disks, but maybe after the basics are agreed on? :)

Unless other people would want it right now, I'm putting this as an unresolved question for the time being. This way it will stay on the "future works" step after this RFC, so that choice of the correct interface doesn't slow down having a first working version.

joepie91 · 2017-05-26T23:02:32Z

Just throwing in my two cents concerning libvirt - in my experience of attempting to use it, it's rather temperamental, and has a habit of producing utterly incomprehensible, difficult-to-debug error messages whenever anything goes wrong.

I can't speak for others, but personally I've started just ignoring the existence of libvirt - since for most cases it seems to add more work trying to figure out why it's breaking, than it's saving me effort on any other point.

teh · 2017-05-28T12:39:37Z

Another data point from me: I had some bad experiences with libvirt - daemon deadlocking and becoming unresponsive, weird error messages, VM corruption. The outstanding bugs are supporting evidence. Abstracting over virtualization is hard problem so maybe I'm too hard on libvirt :)

Ekleog · 2017-06-13T20:53:31Z

OK, so maybe I'm being biased, but it has been open for more than two months and has seen, as far as I can see, mostly positive support (with some propositions for refactoring that don't seem to have caught much traction).

Maybe it would be time to get to a final comments period (idea inspired from Rust's RFCs), where last comments could be raised and the decision would be announced after a given time if no game-changing comment turns up? (cc @zimbatm)

0xABAB · 2017-06-30T21:52:53Z

@Ekleog I am not sure what level of quality we are aiming for, but the "Use case" section would never ever be accepted under my watch in a project where I was in charge, because there is no substance. You should not take this personal, but your use case doesn't actually describe a use case; it just describes some qualities of a human (i.e. someone who cares about security). I didn't even read more of the proposal because of that.

Having said that, as long as you don't break anything pre existing, I am not against adding features, since such developments need to be used and refined over periods of years anyway. No first design is going to be great, unless someone has implemented the exact same thing already before, but given the ever changing technology space, that seems unlikely.

In short, my opinion is that we should basically go for a let's throw code at the wall system and see what sticks approach under the condition that the code is documented in the source code as well as in the manual with the understanding that it is always possible to create a new system based on whatever is learned. I.e., deprecation could happen at some time.

Ekleog · 2017-06-30T23:48:54Z

@0xABAB: Thanks for the comment!

For the “Use Case” section, given the summary section I don't really know how to make it more explicit that the described module is “like containers., but with security over speed”... would you have any better idea of wording for this section? I've pushed a tentative change, how do you feel about it?

zimbatm

👍 overall.

I would like to see more in detail how the qemu VM gets passed to the VM. Potentially it could be a path so arbitrary VMs can be run, and leave the incremental NixOS updates for later.

Also one thing that comes to mind is, would it be possible to take the libvirt configuration options as an inspiration? My guess is that they already played the game of common denominator.

zimbatm · 2017-07-02T11:45:17Z