Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/tests/docker: wait for docker service #109418

Merged
1 commit merged into from Jan 15, 2021
Merged

nixos/tests/docker: wait for docker service #109418

1 commit merged into from Jan 15, 2021

Conversation

ghost
Copy link

@ghost ghost commented Jan 15, 2021

Previously the Docker daemon was started by systemd socket activation.
Thus, the Docker test waited for the sockets.target unit.
But when the docker module was changed to start the Docker daemon at
boot instead of by socket activation, the test was left untouched.
With the Docker 20.10 update this lead to a timing issue, where the
docker command is run before the Docker daemon has started and hangs.

Fixes #109416

Motivation for this change
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@ghost
Copy link
Author

ghost commented Jan 15, 2021

Also CC @mikroskeem who created the Docker 20.10 PR

Previously the Docker daemon was started by systemd socket activation.
Thus, the Docker test waited for the sockets.target unit.
But when the docker module was changed to start the Docker daemon at
boot instead of by socket activation, the test was left untouched.
With the Docker 20.10 update this lead to a timing issue, where the
docker command is run before the Docker daemon has started and hangs.

Fixes #109416
@ghost ghost requested a review from roberth as a code owner January 15, 2021 02:31
@mikroskeem
Copy link
Member

LGTM 👍 Test passed locally (and now I know how to run them)

@ghost ghost merged commit f634c14 into NixOS:master Jan 15, 2021
@roberth
Copy link
Member

roberth commented Jan 15, 2021

I don't think this change is a solution to the real problem. We should be able to rely on socket activation, making the system more robust.

The hanging behavior was probably caused by this change #108960 (review) replacing systemd's socket by docker's own socket.

  1. systemd creates sockets
  2. test runs docker client, which waits for its connection to be accepted
  3. docker starts and replaces the socket
  4. client waits infinitely for the original socket and can't use docker's new socket

@roberth roberth mentioned this pull request Jan 15, 2021
10 tasks
@ghost
Copy link
Author

ghost commented Jan 15, 2021

The explanation for why #108960 broke the tests sounds plausible, but now i'm confused as to whether socket activation for the docker daemon is a thing currently, since the module has this line:

    (mkRemovedOptionModule ["virtualisation" "docker" "socketActivation"] "This option was removed in favor of starting docker at boot")

@mikroskeem
Copy link
Member

Socket activation for Docker is working just fine - see #108960 (review)

Just for testing sake:

{ config, pkgs, lib, ... }:
{
  systemd.services.docker.serviceConfig.ExecStart = with lib; let
    cfg = config.virtualisation.docker;
  in [
    ""
            ''
              ${cfg.package}/bin/dockerd \
                --group=docker \
                --host=fd:// \
                --log-driver=${cfg.logDriver} \
                ${optionalString (cfg.storageDriver != null) "--storage-driver=${cfg.storageDriver}"} \
                ${optionalString cfg.liveRestore "--live-restore" } \
                ${optionalString cfg.enableNvidia "--add-runtime nvidia=${pkgs.nvidia-docker}/bin/nvidia-container-runtime" } \
                ${cfg.extraOptions}
            ''
  ];
  virtualisation.docker.enable = true;
  virtualisation.docker.listenOptions = ["/tmp/docker.sock" "/run/docker.sock"];
}

This reverts the rogue change I introduced and passes two unix sockets to Docker daemon. You can see from the logs that:

Jan 15 15:48:49 nixos dockerd[11567]: time="2021-01-15T15:48:49.419707431+02:00" level=info msg="Daemon has completed initialization"
Jan 15 15:48:49 nixos dockerd[11567]: time="2021-01-15T15:48:49.430896238+02:00" level=info msg="API listen on /tmp/docker.sock"
Jan 15 15:48:49 nixos dockerd[11567]: time="2021-01-15T15:48:49.433524303+02:00" level=info msg="API listen on /run/docker.sock"

socket activation actually works.

@ghost
Copy link
Author

ghost commented Jan 15, 2021

Okay, then we should change the warning for virtualisation.docker.socketActivation to something like "Socket activation is now the default and this option no longer has any effect", revert the fd->unix change and the changes from this PR.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nixos.tests.docker was broken by docker 19.03.4 → 20.10.2 upgrade
2 participants