Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy Targets: Policy/Behavior-free Deployment Hooks (auto-rollbacks, drain events, etc.) #1245

Closed
wants to merge 7 commits into from

Conversation

grahamc
Copy link
Member

@grahamc grahamc commented Mar 6, 2020

(don't merge yet)

This PR adds trivial targets without any particular behavior or policy
to NixOps. These targets allows the system to react to deployment
events, and for users to customize the policy of how to react to their
use case.

In particular, this PR is authored with the intent of providing safe
and automatic rollbacks for embedded NixOS systems which are hard to
access and repair. Thank you to Yakkertech for sponsoring this work.

The following targets are added, with a note of when they become
active:

  • deploy-prepare.target - before a new system is activated.
  • deploy-healthy.target - after NixOps reconnects over SSH.
  • deploy-failed.target - if deploy-health.target fails.
  • deploy-complete.target - after every system in the given
    deployment (ie: --include / --exclude) completes successfully.

During a deploy, for each system, each system has the following steps
executed independently:

  1. copy the system closure
  2. start deploy-prepare.target. If this fails, fail this machine's
    deploy.
  3. run the new system's activation script
  4. close the SSH connection
  5. open a new SSH connection. if this fails, the next step is skipped.
  6. start deploy-healthy.target. If this fails,
    deploy-failed.target is started automatically, and NixOps sees an
    error.

Once every server has completed these steps, if they all completed
successfully NixOps will SSH to each machine and start
deploy-completed.target.

Importantly, none of these targets actually do anything on their
own. NixOps has expressed no preference or policy or behavior.
@adisbladis and I feel the space is too large and complicated for any
one implementation to get it right.

Current Problems and To-Do

  • During activation, NixOS's switch-to-configuration.pl starts
    deploy-prepare.target and its dependents, causing preparation
    steps to run too many times, and at innapropriate times. @adisbladis
    will be working on a solution to this Monday.
  • The Before, Requires, and other directives are very sensitive.
    We need to provide very precise and accurate tests and documentation
    on exactly how a user can and should hook in to these targets.
  • If deploy-prepare gets stuck and will only refuse a deployment,
    NixOps has no way to ignore that and deploy anyway. This is a
    problem, because there is no way to automatically recover without
    manually SSHing in and rolling back to some other system version.

Future Work

  • Potentially move the targets in to NixOS itself, teaching
    nixos-rebuild about these targets.
  • Publish to some well-known location a set of expressions for ways
    these targets can be used to safely implement various use cases.

Use Cases

Draining Phase

Hooking an event in to deploy-prepare.target allows the server
itself to delay a deployment. This can be used to, for example:

  • remove itself from a load balancer
  • wait for an important measurement to complete
  • wait for in-progress builds to complete
  • wait for the mail queue to empty

Example, a server removing itself from an ELB before the deployment,
and adding itself after the system's deployment is healthy:

{ pkgs, ... }: {
  systemd.targets.deploy-prepare.unitConfig.Requires = [ "leave-elb.service" ];
  systemd.services.leave-elb = {
    unitConfig.Before = [ "deploy-prepare.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.awscli ];

    script = ''
      aws elb deregister-instances-from-load-balancer \
          --load-balancer-name prod-web-traffic-load-balancer \
          --instance my-instance-id | jq .
    '';
  };

  systemd.targets.deploy-healthy.unitConfig.Requires = [ "join-elb.service" ];
  systemd.services.join-elb = {
    unitConfig.After = [ "deploy-healthy.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.awscli ];

    script = ''
      aws elb register-instances-with-load-balancer \
          --load-balancer-name prod-web-traffic-load-balancer \
          --instance my-instance-id | jq .
    '';
  };
}

Critical Window Protection

Hooking an event in to deploy-prepare.target allows the server
itself to prevent a deployment, too. This could be used to protect a
system from deployments during a critical time window. For example, a
system which absolutely must remain undisturbed during a critical
production event could flat out refuse the deployment:

Example, a server which refuses deploys after 12:

{ pkgs, ... }: {
  systemd.targets.deploy-prepare.unitConfig.Requires = [
    "no-afternoon-deploys.service"
  ];
  systemd.services.no-afternoon-deploys = {
    unitConfig.Before = [ "deploy-prepare.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;

    script = ''
      hour=$(date +%H)
      if [ $hour -ge 12 ]; then
        echo "Don't deploy during the afternoon!"
        exit 1
      fi
    '';
  };
}

Coordinated Distributed Database Maintenance

Many distributed databases will identify a failed machine and begin
reallocated data, assuming the old machine will not come back. A
graceful, coordinated shutdown can fix this.

For example, with Elasticsearch it is important to disable shard
allocation during a deployment, and forcing a synced flush will
improve recovery time.

Elasticsearch in particular is interesting, because we can use the
deploy-complete hook to ensure every machine has finished before
re-enabling allocation: something NixOS and NixOps do not easily
support right now.

{ config, pkgs, ... }:
let
  esConfig = config.services.elasticsearch;
  esUrl = "http://${esConfig.listenAddress}:${esConfig.port}";
in {
  systemd.targets.deploy-prepare.unitConfig.Requires = [
    "elasticsearch-pre-deploy.service"
  ];
  systemd.services.elasticsearch-pre-deploy = {
    unitConfig.Before = [ "deploy-prepare.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.curl ];

    script = ''
      echo "Disabling shard replication during node shutdown..."

      curl -X PUT "${esUrl}/_cluster/settings?pretty" \
        -H 'Content-Type: application/json' -d'
        {
          "persistent": {
            "cluster.routing.allocation.enable": "primaries"
          }
        }
      '

      echo "Forcing a synced flush to speed up recovery"
      curl -X POST "${esUrl}/_flush/synced?pretty" \
    '';
  };

  systemd.targets.deploy-complete.unitConfig.Requires = [
    "elasticsearch-post-deploy.service"
  ];
  systemd.services.elasticsearch-post-deploy = {
    unitConfig.After = [ "deploy-complete.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.curl ];

    script = ''
      echo "Re-enabling allocation"
      curl -X PUT "${esUrl}" \
        -H 'Content-Type: application/json' -d'
        {
          "persistent": {
            "cluster.routing.allocation.enable": null
          }
        }
      '
    '';
  };
}

Automatic Rollback After a Failed Deployment

With the addition of a timer, we can implement an automatic rollback
in not very much code. In this example, the automatic rollback
is triggered in two cases:

  • deploy-healthy.target is activated, but fails
  • deploy-healthy.target is not activated within 1 minute of
    deploy-prepare.target, indicating the new system configuration
    broke the network or system in a way preventing the deployment host
    from confirming the new system.
{ pkgs, ... }:
{
  systemd.services.automatic-rollback = {
    description = "Automatic rollback";
    wantedBy = [ "deploy-failed.target" ];
    script = ''
      echo "Rolling back!"
      nix-env --rollback -p /nix/var/nix/profiles/system
      /nix/var/nix/profiles/system/bin/switch-to-configuration switch
    '';
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = false;
  };

  systemd.timers.automatic-rollback = {
    enable = true;
    wantedBy = [ "deploy-prepare.target" ];

    unitConfig = {
      Conflicts = [ "deploy-healthy.target" ];
    };
    timerConfig = {
      OnActiveSec = "1m";
      RemainAfterElapse = false;
    };
  };
}

note: the examples are just examples and have not been tested

@grahamc grahamc changed the title Deploy targets start Deploy Targets: Policy/Behavior-free Deployment Hooks (auto-rollbacks, drain events, etc.) Mar 6, 2020
@infinisil
Copy link
Member

During a deploy, for each system, each system has the following steps
executed independently:

I think it probably makes sense to have these steps depend on another for the sake of not having partial deploys: All deploy-prepare.targets should be required to finish activating before any activation scripts are called. Similar to how a two-phase commit protocol works.

adisbladis added a commit to adisbladis/nixpkgs that referenced this pull request Mar 9, 2020
This is to facilitate units that should _only_ be manually started and
not activated when a configuration is switched to.

More specifically this is to be used by the new Nixops deploy-*
targets created in NixOS/nixops#1245 that are
triggered by Nixops before/after switch-to-configuration is called.
@michaelpj
Copy link

Potentially move the targets in to NixOS itself, teaching nixos-rebuild about these targets.

It seems like for NixOS there are a couple of Nixops-indepentent stages that could have targets:

  • pre-activation: activated by nixos-rebuild before it begins activating the system
  • post-activation: activated by nixos-rebuild after it finishes activating the system.

I'm not sure if we want to do this, but we could attach some semantics to these that would be useful even outside Nixops. For example, we could have nixos-rebuild do an automatic rollback if post-activation fails. Then a user could add a dependency from post-activation to a particular service and that would automatically roll back any upgrade where that service is failing afterwards.

If we did this, then deploy-prepare could perhaps be replaced by pre-activation. deploy-healthy is a bit more than post-activation, since it also includes the "external" health check of Nixops being able to reconnect to the system.

@grahamc
Copy link
Member Author

grahamc commented Mar 9, 2020

@infinisil, I like that idea.

@michaelpj,

Potentially move the targets in to NixOS itself, teaching nixos-rebuild about these targets.

It seems like for NixOS there are a couple of Nixops-indepentent stages that could have targets:

I think you're right, and these sorts of additional targets are exactly why I hesitate on implementing this in NixOS directly. I think there are so many use cases and details to consider. I like the idea of renaming deploy-prepare.target to pre-activation.target, and creating a post-activation.target in addition to the deploy-healthy.target.

I'm not sure if we want to do this, but we could attach some semantics to these that would be useful even outside Nixops.

Since the targets in this these are policy and behavior free, the only value in these targets is if they have defined semantics from the start.

For example, we could have nixos-rebuild do an automatic rollback if post-activation fails.

This gets a little bit tricky, because tools like NixOps don't use nixos-rebuild -- so we need to be 100% certain the behavior and semantics match across the ecosystem.

Additionally, the one thing I've learned researching this implementation is applying any behavior by default is likely risky: there are so many different use cases and policies. In addition, these specific targets are potentially only the beginning. This is also why I don't want to yet introduce this in to NixOS itself.

Consider a case of a more careful process around bootloaders, we very well may want to implement behavior like:

  1. copy closure
  2. test-activate, if failed roll back
  3. validate the connection works
  4. write a bootloader in a "one-time-boot" mode
  5. next boot, if it fails, reboot again causing a bootloader-level rollback
  6. on-boot, confirm the new, booted system can be booted and is permanently defaulted in the bootloader

This potentially requires cooperation from something like https://www.intel.com/content/www/us/en/support/articles/000007197/server-products/server-boards.html

My goal for this PR specifically is to make a small nibble of progress towards a more robust confirmed-safe deployment process.

@michaelpj
Copy link

Since the targets in this these are policy and behavior free, the only value in these targets is if they have defined semantics from the start.

I take your point, and I agree that adding behaviour complicates things a lot!

pre-activation and post-activation have the advantage that they have a clear meaning: you run them before and after the activation script, regardless of whether that's invoked by nixos-rebuild or nixops.

Perhaps automatic rollbacks are too scary a behaviour, but something simpler like triggering a monitoring endpoint might be a more obviously okay use of post-activation.

@infinisil
Copy link
Member

@grahamc The rollback behavior you described there is pretty much how I implemented it here: https://github.com/Infinisil/nixoses/blob/42ae5dd61a7c65c172bb5d27785879fdbd6c3516/scripts/switch, and the accompanying Nix code: https://github.com/Infinisil/nixoses/blob/42ae5dd61a7c65c172bb5d27785879fdbd6c3516/modules/deploy.nix

Has been working out pretty well for me

@grahamc
Copy link
Member Author

grahamc commented Mar 9, 2020

edited to delete a bunch of copypasta which showed up for reasons I can't explain

Perhaps automatic rollbacks are too scary a behaviour, but something simpler like triggering a monitoring endpoint might be a more obviously okay use of post-activation.

On the contrary, this is exactly what I'm using it for! But it is, importantly, up to the people maintaining the machines what the right use case is :).

@grahamc
Copy link
Member Author

grahamc commented Mar 9, 2020

I went to implement starting deploy-prepare.target on each host first, and it reminded me of why I chose to not do that in the first place. Since the timer may start when deploy-prepare.target is reached, it is important that the very next step is activation. Otherwise we create a pretty nasty race condition on large networks, or networks with slow links.

I think this hints towards an additional target, deploy-begin.target, and renaming deploy-prepare.target to pre-activation.target.

@grahamc
Copy link
Member Author

grahamc commented Mar 9, 2020

A question is should nixops start deploy-begin...

I'm inclined to do it right before activation, and after copying the closure.

And moreso, I'm thinking only doing it if the deployment mode is test or switch, and not in a build-only, copy-only, or dry-activate mode.

dtzWill pushed a commit to dtzWill/nixpkgs that referenced this pull request Mar 10, 2020
This is to facilitate units that should _only_ be manually started and
not activated when a configuration is switched to.

More specifically this is to be used by the new Nixops deploy-*
targets created in NixOS/nixops#1245 that are
triggered by Nixops before/after switch-to-configuration is called.

(cherry picked from commit db6c943)
grahamc and others added 2 commits March 13, 2020 11:38
Co-Authored-By: adisbladis <adisbladis@gmail.com>
…lure

Co-Authored-By: adisbladis <adisbladis@gmail.com>
Co-Authored-By: adisbladis <adisbladis@gmail.com>
… in the case of a network problem which expired a timer. The positioning of this is obvious if we rename deploy-healthy to deploy-commit
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update/6525/1

@grahamc grahamc added this to In progress in kanban Apr 23, 2020
@grahamc grahamc moved this from In progress to To do in kanban Apr 23, 2020
@Ma27
Copy link
Member

Ma27 commented May 18, 2021

@grahamc @adisbladis may I ask if there's any chance that this will be picked up at some point? I could help out a bit maybe if somebody tells me what's missing / tbd here :)

Comment on lines +483 to +488
if code == 0:
return True
elif code == 5:
return True
else:
return False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if code == 0:
return True
elif code == 5:
return True
else:
return False
return code == 0 or code == 5

Comment on lines +464 to +469
if code == 0:
return True
elif code == 5:
return True
else:
return False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if code == 0:
return True
elif code == 5:
return True
else:
return False
return code == 0 or code == 5

Comment on lines +419 to +424
if code == 0:
return True
elif code == 5:
return True
else:
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if code == 0:
return True
elif code == 5:
return True
else:
return False
return code == 0 or code == 5

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/seamless-nixos-rebuild-switch-with-network-restart/29312/3

@grahamc grahamc closed this Oct 10, 2023
kanban automation moved this from To do to Done Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
kanban
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants