Skip to content

Commit

Permalink
nix/ssh-tunnel: Don't set addrgenmode to none
Browse files Browse the repository at this point in the history
This partially reverts 9b382b1.

It turns out that
torvalds/linux@cc9da6c itself isn't the
real issue, but just caused the real bug to surface.

After hours of debugging together with Hannes Frederic Sowa, we finally
found what's the culprit here, so the commit I'm partially reverting is
merely a workaround which doesn't fix anything but just causes the bug
to show up less frequently.

As for the issue itself: OpenSSH polls the file descriptors of the tun
devices and as soon as poll returns the descriptor as being writeable,
the data coming from the SSH stream is written to the tun device.

In this particular case the mentioned kernel commit causes router
solicitations to be sent to the tun device, which in theory shouldn't be
a problem either.

However, if one tun device is up earlier than the counterpart, these
solicitation packets are sent to the tun device and get forwarded over
the SSH connection to the other peer, which then tries to write these
packets into the tun device and fails because the interface is down.

Let's illustrate this so it becomes clear:
                                _  ____
 .-------------. Channel A  ___( )(    )___  Channel B .--------------.
 | SSH Client  | --------> (__ Internet ___) --------> |  SSH Server  |
 |     /|\     |             (____)(____)              |      |       |
 |      |      |                                       |     \|/      |
 | Tun A (UP)  |                                       | Tun B (DOWN) |
 `-------------'                                       `--------------'

So we have tunnel device A here which is already up while the tunnel
device B isn't yet. Tun A now gets a router solicitation packed which
the SSH client sends over to the SSH server over Channel A and B.

The SSH server writes the packet into Tun B, but it is down, so the
write() fails with errno EIO (Input/output error) and that in turn
causes the SSH server to shut down its channel, resulting in follow-up
traffic to end up in the middle of nowhere.

While trying to write a patch for OpenSSH to either avoid writing to the
tun device when it's down or ignore EIO, I figured out that this really
is a kernel bug and also a very longlasting one.

Since torvalds/linux@1bd4978 the
following code is at the start of tun_get_user(), which is called when
user space tries to write() to the device:

  if (!(tun->dev->flags & IFF_UP))
    return -EIO;

Unfortunately tun_chr_poll() doesn't take this into account and
designate the device as being writeable even if it's not UP, so the real
fix here should indicate the file descriptor being able to write when
it's really up.

As an analogy, it's a bit like this:

  tun_chr_poll(): "Hey, you can write to the FD, go on :-)"
  openssh: write(...);
  tun_get_user(): "Haha, I lied, you can't!"

In summary: Setting addrgenmode to none isn't addressing this at all and
it only worked in my tests because the addrgenmode has been set quickly
enough to not cause router solicitations to be sent. It's also a kernel
problem, so it's not the resonsibility of NixOps to work around or try
to fix it.

The reason I'm only doing a partial revert is that using iproute2
instead of nettools makes the Nix expression more readable and along
switching to iproute2 I've also chained the commands via && to catch
failures instead of ignoring them.

I also did more comprehensive tests during the last few days and I could
confirm that the workaround didn't prevent the bug from happening at all
but only caused it to be less frequent.

Here is a VM test I wrote to confirm this:

https://gist.github.com/aszlig/5e8c8462d17f787b7f171c34090fd590

So if all goes well, the real fix should land with the upcoming stable
kernel updates.

Thanks again to Hannes for help on debugging this properly plus getting
this fixed soon in the kernel (I did a first patch, but that one wasn't
sufficient enough and he's obviously the expert in this matter).

Signed-off-by: aszlig <aszlig@redmoonstudios.org>
  • Loading branch information
aszlig committed Mar 10, 2017
1 parent 5f01191 commit 4bbceb2
Showing 1 changed file with 0 additions and 1 deletion.
1 change: 0 additions & 1 deletion nix/ssh-tunnel.nix
Expand Up @@ -79,7 +79,6 @@ with lib;

script = let
mkAddrConf = tun: localIP: remoteIP: concatStringsSep " && " [
"ip link set tun${toString tun} addrgenmode none"
"ip addr add ${localIP}/32 peer ${remoteIP} dev tun${toString tun}"
"ip link set tun${toString tun} up"
];
Expand Down

1 comment on commit 4bbceb2

@edolstra
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great commit message 🍻

Please sign in to comment.