Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persistent spot #1102

Merged
merged 2 commits into from Apr 22, 2019
Merged

Conversation

PsyanticY
Copy link
Contributor

@PsyanticY PsyanticY commented Feb 20, 2019

This a followup and some fixes to #1011
cc @Chakerbh

@PsyanticY PsyanticY changed the title [WIP] Support persistant spot Support persistant spot Feb 26, 2019
@PsyanticY PsyanticY force-pushed the support_persistant_spot branch 2 times, most recently from 34950b4 to 5d447d4 Compare February 26, 2019 21:51
@PsyanticY PsyanticY force-pushed the support_persistant_spot branch 2 times, most recently from f2aea25 to c59d142 Compare March 21, 2019 17:57
@nh2
Copy link
Contributor

nh2 commented Apr 18, 2019

I’ve just tested to deploy my staging environment with this branch and it worked fine for me.

Note I haven’t tried out any of the new features added, only whether what I’m using right now still works (I also use a spot instance, but not yet the new flags you’ve added).

Copy link
Contributor

@nh2 nh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one odd thing:

https://github.com/NixOS/nixops/pull/1102/commits shows the first commit appear twice, and when I try to rebase this PR on top of something else, it complains about an empty commit.

Why is that / can it be fixed?

@PsyanticY
Copy link
Contributor Author

@nh2 Thanks for the review. i did get @Chakerbh branch and added o ntop of it so i may did something wrong there to cause the commit message to appear twice Ill check and fix that.

@PsyanticY PsyanticY changed the title Support persistant spot Support persistent spot Apr 18, 2019
@PsyanticY PsyanticY force-pushed the support_persistant_spot branch 2 times, most recently from 0bca858 to c685543 Compare April 19, 2019 09:53
@PsyanticY
Copy link
Contributor Author

@nh2 So i fixed the latter commits but i was not able to do for the duplicates ones i got a lot of conflicts when rebasing :( .

@AmineChikhaoui
Copy link
Member

@PsyanticY maybe try rebasing and squashing any extra commit + the 2 first commits

@PsyanticY PsyanticY force-pushed the support_persistant_spot branch 7 times, most recently from 1c85af6 to 587686c Compare April 20, 2019 00:49
This means when the instance gets interrupted, the request is still
valid, and if the termination behaviour is set to "stop", the instance
will start automatically when the request get fulfilled again by AWS.

There's no way to achieve this with Boto2, so I changed the call that
create the spot request to Boto3. I did some refactoring, so the same
call is used when creating both On demand and spot instances.
@PsyanticY
Copy link
Contributor Author

PsyanticY commented Apr 20, 2019

@AmineChikhaoui Done squashing the commits, also did re-base the pr on current master. (man that was a pain)

@AmineChikhaoui
Copy link
Member

@PsyanticY thanks !
Looking at the spot instance requests docs, looks like we can actually test this manually. As the request will remain active in case of persistent unless we cancel it, an aws ec2 terminates-instances call should fake a spot instance termination and the instance should restart at some point (at least this is my understanding).
I'm trying with that, would like to double check the behavior before merging.

@AmineChikhaoui
Copy link
Member

Ok interesting, so here is what happens:

I create a persistent spot instance request sir-foo, that request gets fullfilled with instance id i-first, then I terminate the instance. When following the output of describe-spot-instance-requests, the sir creates another instance i-second after a couple of minutes, now nixops doesn't have any idea about that so the deploy would fail as it doesn't check the output of the sir again to pull out the new vm id.
Even a destroy would fail and barf about the sir being fulfilled unexpectedly with the new instance id, the only way to destroy is to run nixops check followed with nixops destroy as the first will update the state with vmId = None which will allow the cancellation of the sir as we don't have a way to compare vm ids anymore.

So, I believe that this PR needs more work to teach nixops to recover cleanly after a termination behavior of stop. The good part is at least it's possible to fake a spot instance termination :)

@AmineChikhaoui
Copy link
Member

(experimenting with some labels to make filtering PRs easy)

@PsyanticY
Copy link
Contributor Author

I think the way you did test the persistent spot is wrong since persistent spot relies on stopping the instance when we are out of capacity and restart it when we have. since the instance is stopped the instance id is not lost and we can recover cleanly given that. Terminating a persistent spot is not supported since if you even try to deploy a a persistent spot with spotInstanceInterruptionBehavior set to terminate ec2 will complain about InvalidParameterCombination:

master..> got (possibly transient) EC2 error '{'Message': 'The terminate InstanceInterruptionBehavior is not supported when requestType is set to persistent.', 'Code': 'InvalidParameterCombination'}', retrying...
master..> got (possibly transient) EC2 error '{'Message': 'The terminate InstanceInterruptionBehavior is not supported when requestType is set to persistent.', 'Code': 'InvalidParameterCombination'}', retrying...
master..> got (possibly transient) EC2 error '{'Message': 'The terminate InstanceInterruptionBehavior is not supported when requestType is set to persistent.', 'Code': 'InvalidParameterCombination'}', retrying...
^Cerror: interrupted

@AmineChikhaoui
Copy link
Member

Ok I think you're right the annoucement mentioned:

Upon restart, the EBS root device is restored from its prior state, previously attached data volumes are reattached, and the instance retains its instance ID.

@AmineChikhaoui AmineChikhaoui merged commit efcca12 into NixOS:master Apr 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants