Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hetzner: do not reboot instance after partitioning failed #433

Closed
wants to merge 1 commit into from

Conversation

danbst
Copy link
Contributor

@danbst danbst commented May 12, 2016

There may be errors during partitioning phase (incorrect configuration, problems with devices, etc), but they are rarely critical to reboot rescue shell.
With this change time to redeploy after partition error is reduced by 1 minute.

I still do bootstrapping phase, because nixpart version could be changed

Also I'd like to ask nixops devs:

  1. Is it correct place to define new state?
  2. Bootstrapping phase doesn't remove old bootstrapping tools, just overwrites them. I see, there is check on whether space is enough. So with this change redeploy after error can fail on very-very small instances (where total space in rescue tmpfs is less them 700Mb). Though it affects only "paritioning exception" scenario, so no tests would be affected

part of #429

There may be errors during partitioning phase (incorrect configuration, problems
with devices, etc), but they are rarely critical to reboot rescue shell. With this change
time to redeploy after error is reduced by 1 minute.
@rbvermaa
Copy link
Member

@aszlig Do you have any opinions on this?

@@ -591,7 +597,11 @@ def create(self, defn, check, allow_reboot, allow_recreate):

if not self.vm_id:
self.log("installing machine...")
self.reboot_rescue(install=True, partitions=defn.partitions)
Copy link
Member

@aszlig aszlig May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same can be achieved by just checking for the RESCUE state here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too cautious here =)

@aszlig
Copy link
Member

aszlig commented May 18, 2016

I originally did this because sometimes when blivet has failed it could end up in an unrecoverable state, so I'm not sure whether it makes sense to reuse the rescue from that state. Maybe it would make sense to add a flag or even just emit a warning telling the user what to do if the he/she wants to throw away the current state.

@danbst
Copy link
Contributor Author

danbst commented May 20, 2016

@aszlig Thanks for review!

1 > blivet has failed it could end up in an unrecoverable state
What kind of unrecoverable state? Did rescue reboot fix it? I ask, because I've caught the "unrecoverable state" that even rescue reboot didn't fix (caused by manually configured LVM over RAID). Even nixops destroy didn't fix it

2 > or even just emit a warning telling the user what to do if the he/she wants to throw away the current state
I do like this approach!

hetznerIron> installing machine...
hetznerIron> rescue system is present and accessible
hetznerIron> partitioning...
hetznerIron> Exception: partitioning failed! Try one of these before redeploy:
  1) Check your kickstart script.
  2) Run "destroy" command to restart rescue system.
  3) Login to rescue with password XXXX and try to fix problems manually.

3 This isn't related to my changes, but there is an easy way to brick Nixops rescue mode. Just press Ctrl-C during rescue reboot

hetznerIron> installing machine...
hetznerIron> rebooting machine в??hetznerIronв?T (MY_IP) into rescue system
hetznerIron> sending reboot command... done.
hetznerIron> waiting for rescue system...Connection to MY_IP closed by remote host.
[down]....^C
$

Now NixOps cannot reboot instance (wrong password). I think this is because of lines https://github.com/danbst/nixops/blob/7c60f056f2438c58796c9dc5dd4bc1030bfddcdf/nixops/backends/hetzner.py#L357-L358

        self._wait_for_rescue(self.main_ipv4)
        self.rescue_passwd = rescue_passwd

I can fill issue for this if reboot on bootstrap problems will be left.

@domenkozar
Copy link
Member

Maybe it just could be a flag to enable this?

@danbst
Copy link
Contributor Author

danbst commented Dec 6, 2016

@domenkozar unfortunately, I have no free Hetzner machines to further maintain this PR.

@danbst danbst closed this Dec 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants