New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploy Targets: Policy/Behavior-free Deployment Hooks (auto-rollbacks, drain events, etc.) #1245
Conversation
e0b2a42
to
aa68e11
Compare
aa68e11
to
1aecac5
Compare
I think it probably makes sense to have these steps depend on another for the sake of not having partial deploys: All |
This is to facilitate units that should _only_ be manually started and not activated when a configuration is switched to. More specifically this is to be used by the new Nixops deploy-* targets created in NixOS/nixops#1245 that are triggered by Nixops before/after switch-to-configuration is called.
It seems like for NixOS there are a couple of Nixops-indepentent stages that could have targets:
I'm not sure if we want to do this, but we could attach some semantics to these that would be useful even outside Nixops. For example, we could have If we did this, then |
@infinisil, I like that idea.
I think you're right, and these sorts of additional targets are exactly why I hesitate on implementing this in NixOS directly. I think there are so many use cases and details to consider. I like the idea of renaming
Since the targets in this these are policy and behavior free, the only value in these targets is if they have defined semantics from the start.
This gets a little bit tricky, because tools like NixOps don't use Additionally, the one thing I've learned researching this implementation is applying any behavior by default is likely risky: there are so many different use cases and policies. In addition, these specific targets are potentially only the beginning. This is also why I don't want to yet introduce this in to NixOS itself. Consider a case of a more careful process around bootloaders, we very well may want to implement behavior like:
This potentially requires cooperation from something like https://www.intel.com/content/www/us/en/support/articles/000007197/server-products/server-boards.html My goal for this PR specifically is to make a small nibble of progress towards a more robust confirmed-safe deployment process. |
I take your point, and I agree that adding behaviour complicates things a lot!
Perhaps automatic rollbacks are too scary a behaviour, but something simpler like triggering a monitoring endpoint might be a more obviously okay use of |
@grahamc The rollback behavior you described there is pretty much how I implemented it here: https://github.com/Infinisil/nixoses/blob/42ae5dd61a7c65c172bb5d27785879fdbd6c3516/scripts/switch, and the accompanying Nix code: https://github.com/Infinisil/nixoses/blob/42ae5dd61a7c65c172bb5d27785879fdbd6c3516/modules/deploy.nix Has been working out pretty well for me |
edited to delete a bunch of copypasta which showed up for reasons I can't explain
On the contrary, this is exactly what I'm using it for! But it is, importantly, up to the people maintaining the machines what the right use case is :). |
I went to implement starting I think this hints towards an additional target, |
A question is should nixops start I'm inclined to do it right before activation, and after copying the closure. And moreso, I'm thinking only doing it if the deployment mode is |
This is to facilitate units that should _only_ be manually started and not activated when a configuration is switched to. More specifically this is to be used by the new Nixops deploy-* targets created in NixOS/nixops#1245 that are triggered by Nixops before/after switch-to-configuration is called. (cherry picked from commit db6c943)
Co-Authored-By: adisbladis <adisbladis@gmail.com>
…lure Co-Authored-By: adisbladis <adisbladis@gmail.com>
1aecac5
to
808d0a4
Compare
Co-Authored-By: adisbladis <adisbladis@gmail.com>
808d0a4
to
f1ed103
Compare
This reverts commit df56da0.
… to deploy that exact top level.
… in the case of a network problem which expired a timer. The positioning of this is obvious if we rename deploy-healthy to deploy-commit
13fa943
to
efb98a5
Compare
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: |
@grahamc @adisbladis may I ask if there's any chance that this will be picked up at some point? I could help out a bit maybe if somebody tells me what's missing / tbd here :) |
if code == 0: | ||
return True | ||
elif code == 5: | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if code == 0: | |
return True | |
elif code == 5: | |
return True | |
else: | |
return False | |
return code == 0 or code == 5 |
if code == 0: | ||
return True | ||
elif code == 5: | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if code == 0: | |
return True | |
elif code == 5: | |
return True | |
else: | |
return False | |
return code == 0 or code == 5 |
if code == 0: | ||
return True | ||
elif code == 5: | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if code == 0: | |
return True | |
elif code == 5: | |
return True | |
else: | |
return False | |
return code == 0 or code == 5 |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/seamless-nixos-rebuild-switch-with-network-restart/29312/3 |
(don't merge yet)
This PR adds trivial targets without any particular behavior or policy
to NixOps. These targets allows the system to react to deployment
events, and for users to customize the policy of how to react to their
use case.
In particular, this PR is authored with the intent of providing safe
and automatic rollbacks for embedded NixOS systems which are hard to
access and repair. Thank you to Yakkertech for sponsoring this work.
The following targets are added, with a note of when they become
active:
deploy-prepare.target
- before a new system is activated.deploy-healthy.target
- after NixOps reconnects over SSH.deploy-failed.target
- ifdeploy-health.target
fails.deploy-complete.target
- after every system in the givendeployment (ie:
--include
/--exclude
) completes successfully.During a deploy, for each system, each system has the following steps
executed independently:
deploy-prepare.target
. If this fails, fail this machine'sdeploy.
deploy-healthy.target
. If this fails,deploy-failed.target
is started automatically, and NixOps sees anerror.
Once every server has completed these steps, if they all completed
successfully NixOps will SSH to each machine and start
deploy-completed.target
.Importantly, none of these targets actually do anything on their
own. NixOps has expressed no preference or policy or behavior.
@adisbladis and I feel the space is too large and complicated for any
one implementation to get it right.
Current Problems and To-Do
switch-to-configuration.pl
startsdeploy-prepare.target
and its dependents, causing preparationsteps to run too many times, and at innapropriate times. @adisbladis
will be working on a solution to this Monday.
Before
,Requires
, and other directives are very sensitive.We need to provide very precise and accurate tests and documentation
on exactly how a user can and should hook in to these targets.
deploy-prepare
gets stuck and will only refuse a deployment,NixOps has no way to ignore that and deploy anyway. This is a
problem, because there is no way to automatically recover without
manually SSHing in and rolling back to some other system version.
Future Work
nixos-rebuild
about these targets.these targets can be used to safely implement various use cases.
Use Cases
Draining Phase
Hooking an event in to
deploy-prepare.target
allows the serveritself to delay a deployment. This can be used to, for example:
Example, a server removing itself from an ELB before the deployment,
and adding itself after the system's deployment is healthy:
Critical Window Protection
Hooking an event in to
deploy-prepare.target
allows the serveritself to prevent a deployment, too. This could be used to protect a
system from deployments during a critical time window. For example, a
system which absolutely must remain undisturbed during a critical
production event could flat out refuse the deployment:
Example, a server which refuses deploys after 12:
Coordinated Distributed Database Maintenance
Many distributed databases will identify a failed machine and begin
reallocated data, assuming the old machine will not come back. A
graceful, coordinated shutdown can fix this.
For example, with Elasticsearch it is important to disable shard
allocation during a deployment, and forcing a synced flush will
improve recovery time.
Elasticsearch in particular is interesting, because we can use the
deploy-complete
hook to ensure every machine has finished beforere-enabling allocation: something NixOS and NixOps do not easily
support right now.
Automatic Rollback After a Failed Deployment
With the addition of a timer, we can implement an automatic rollback
in not very much code. In this example, the automatic rollback
is triggered in two cases:
deploy-healthy.target
is activated, but failsdeploy-healthy.target
is not activated within 1 minute ofdeploy-prepare.target
, indicating the new system configurationbroke the network or system in a way preventing the deployment host
from confirming the new system.
note: the examples are just examples and have not been tested