Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying arbitrary SSH configuration for nodes #1024

Closed
wants to merge 12 commits into from

Conversation

grahamc
Copy link
Member

@grahamc grahamc commented Oct 21, 2018

I'd like to deploy to nodes only available behind a bastion server. This patch letsyou do this:

  my-node.deployment.sshConfigOptionsFile = ./ssh-config;

and ./ssh-config:

ProxyCommand ssh 10.5.3.111 -A -W %h:%p
ForwardAgent yes
LogLevel VERBOSE

Initial testing shows this works, I'll do further testing tomorrow.

@grahamc
Copy link
Member Author

grahamc commented Oct 21, 2018

It took one more patch, but I can now provision and deploy nodes behind a bastion. Further testing required.

@grahamc
Copy link
Member Author

grahamc commented Oct 21, 2018

I can now force reboot nodes and have it wait properly for the machine to be up. A couple more TCP waiting patches to make.

@coretemp
Copy link

Are you sure that this also handles the ssh-for-each case? If so, please merge. If not, please try again :)

@grahamc
Copy link
Member Author

grahamc commented Dec 10, 2018

$ nixops ssh-for-each hostname
mac7........> mac7
mac6........> mac6
mac5........> mac5
mac4........> mac4
mac3........> mac3
mac2........> mac2
mac1........> mac1
mac8........> mac8

@AmineChikhaoui AmineChikhaoui requested review from aszlig and removed request for aszlig December 11, 2018 14:58
@grahamc
Copy link
Member Author

grahamc commented Dec 12, 2018

OK I think this is good to go:

$ nixops deploy --force-reboot --include mac1 
building all machine configurations...
mac1........> copying closure...
personal> closures copied successfully
mac1........> rebooting...
mac1........> waiting for the machine to finish rebooting...[down]channel 0: open failed: connect failed: Connection refused
stdio forwarding failed
ssh_exchange_identification: Connection closed by remote host

mac1........> could not connect to ‘root@192.168.2.101’, retrying in 1 seconds...
mac1........> activation finished successfully
personal> deployment finished successfully

Copy link
Member

@aszlig aszlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the TCP ports check to use SSH instead is something of a mixed bag, so while this certainly makes handling of jump hosts easier, the timeout argument to run_command is only the connection timeout, which could possibly hang ssh on multiple occasions.

So I'd probably add a few tests for the none backend with flaky networking to see whether that really is robust enough. The only race condition I know for sure is when the machine reboots, but I remember seeing SSH hangs even during boot, eg. in something like this, which is a similar scenario.

nixops/ssh_util.py Outdated Show resolved Hide resolved
nixops/ssh_util.py Outdated Show resolved Hide resolved
@@ -650,7 +651,9 @@ def _wait_stop(self):
"""
self.log_start("waiting for system to shutdown... ")
dotlog = lambda: self.log_continue(".") # NOQA
wait_for_tcp_port(self.main_ipv4, 22, open=False, callback=dotlog)
while self.try_ssh():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit racy, because the timeout for run_command is only the SSH connect timeout, so if the machine is no longer reachable after the key exchange try_ssh could hang for minutes.

Btw. the same problem could occur in #857, hence cc @nh2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd switch to Python 3.x (I think it's long overdue), we could use subprocess.run with the timeout argument for logged_exec and in SSHMaster, but in the interim it's still possible in Python 3.4 and lower using wait and kill. We already have a timeout argument, so I'd either rename the argument or add another command_timeout argument. Note that we can't simply get rid of the connect timeout and just use the command timeout, because there might be long-running operations, like switching to a new configuration.

nixops/backends/hetzner.py Outdated Show resolved Hide resolved
@asymmetric
Copy link
Contributor

@grahamc are you still planning on working on this?

@grahamc
Copy link
Member Author

grahamc commented Jun 8, 2019

The commits since 0192a61 address almost all the feedback from @aszlig, but I'm not sure what to do about the timeout you raise.

Additionally, I no longer require this PR and am actually not even able to really test it as I don't have any bastion use cases.

@dhess
Copy link
Contributor

dhess commented Jun 8, 2019

I'll give this a try sometime in the next few weeks. I plan to put my new deployments behind a bastion host.

@grahamc grahamc requested a review from aszlig June 8, 2019 18:41
@dhess
Copy link
Contributor

dhess commented Jun 11, 2019

The bastion functionality works perfectly, so that's very nice, as I couldn't get this to work without the patch (NixOps hangs on waiting for SSH...).

However, I did have one case where nixops reboot did not detect that the host had come back up, so this seems to confirm @aszlig's comment about the potential race condition. (I have had at least one other reboot work fine, however.)

@peterhoeg
Copy link
Member

I'd love this functionality as well. Good to merge with @dhess's blessing?

@dhess
Copy link
Contributor

dhess commented Sep 3, 2019

Unfortunately, I think it's not ready. nixops reboot frequently loses track of the rebooted host.

I may have time to look into a fix for this in the next few weeks.

@grahamc
Copy link
Member Author

grahamc commented Mar 26, 2020

Hello!

Thank you for this PR.

In the past several months, some major changes have taken place in
NixOps:

  1. Backends have been removed, preferring a plugin-based architecture.
    Here are some of them:

  2. NixOps Core has been updated to be Python 3 only, and at the
    same time, MyPy type hints have been added and are now strictly
    required during CI.

This is all accumulating in to what I hope will be a NixOps 2.0
release
. There is a tracking issue for that:
#1242 . It is possible that
more core changes will be made to NixOps for this release, with a
focus on simplifying NixOps core and making it easier to use and work
on.

My hope is that by adding types and more thorough automated testing,
it will be easier for contributors to make improvements, and for
contributions like this one to merge in the future.

However, because of the major changes, it has become likely that this
PR cannot merge right now as it is. The backlog of now-unmergable PRs
makes it hard to see which ones are being kept up to date.

If you would like to see this merge, please bring it up to date with
master and reopen it
. If the or mypy type checking fails, please
correct any issues and then reopen it. I will be looking primarily at
open PRs whose tests are all green.

Thank you again for the work you've done here, I am sorry to be
closing it now.

Graham

@grahamc grahamc closed this Mar 26, 2020
@wmertens
Copy link
Contributor

wmertens commented Jul 6, 2020

😆 Fun to see the conversation between you and you @grahamc ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants