Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 broken again #93

Closed
michicc opened this issue Jul 14, 2019 · 10 comments
Closed

IPv6 broken again #93

michicc opened this issue Jul 14, 2019 · 10 comments

Comments

@michicc
Copy link
Member

michicc commented Jul 14, 2019

IPv6 broke again. Please come back 馃樋

@TrueBrain
Copy link
Member

It will keep breaking every time kubernetes upgrades. Until DigitalOcean support IPv6 on their LoadBalancer, this will be the game we are playing :( Sorry about that ..

Fixed now, should be up and running again!

@michicc
Copy link
Member Author

michicc commented Jul 14, 2019

Is there a manual change needed, or would a simple cron restart help?

@TrueBrain
Copy link
Member

As you might imagine, if a cron restart would fix it, it would already been there ;)

DigitalOcean LoadBalancer doesn't support IPv6, so there is a Droplet on an IPv6 acting as one. It redirects traffic on port 80/443 to the kubernetes cluster, just like the DO LB would do. But for this it needs to know the internal IPs of all the kubernetes nodes. Sadly, these change every time you upgrade a node (as the old one is killed, and a new one is created).

You can make kubernetes call some API when ever a new node appears, and this might be a solution, but it requires some nifty scripting on the IPv6 Droplet. So far I haven't found a clean way that doesn't involve me writing a custom API handler :D

Updates are rare and far apart, but possibly I could get something in there that informs me when ever it lost connection to a kubernetes node .. and email me or something. Will look into that.

@LordAro
Copy link
Member

LordAro commented Jul 27, 2019

Might as well reuse the last issue. It's gone sad again :(

@LordAro LordAro reopened this Jul 27, 2019
@LordAro
Copy link
Member

LordAro commented Jul 27, 2019

Though maybe this is a slightly different issue - the website itself is still fine on IPv6, but downloads (via proxy) are broken

@TrueBrain
Copy link
Member

TrueBrain commented Jul 27, 2019

One of the two nodes we run on is down (for some reason). The balancer should switch to the other node, but clearly it is not. Cycling the node now should fix the current issue at least, and I will investigate why it didn't fail over to the other node.

@TrueBrain
Copy link
Member

TrueBrain commented Jul 27, 2019

Problem still isn't really resolved. One of the two nodes is failing 50% of the healthchecks. Logs indicate nothing why this is happening. IPv4 traffic is now routed via the working node. IPv6 is still a bit touch&go.

Going to disable IPv6 for a bit to get some further information on the issue. Should be back within 10 minutes or so.

Edit: and it is enabled again; still with degraded performance

@TrueBrain
Copy link
Member

Further investigation shows that it is something deep in kubernetes. We were running on two nodes; now on three. The third node shows exactly the same issue as the second: kube-proxy is dropping connections randomly. The first however is working just fine.

I tried various of things, but nothing seems to change the situation.

I now degraded both LoadBalancer (IPv4 and IPv6) to only use the first node for kube-proxy for now. This seems to be working and stable. I reached out to DigitalOcean, see if they know what is going on.

If I cannot find a solution, I will upgrade kubernetes to 1.14 next week, in the hope that fixes it.

To be continued!

@TrueBrain
Copy link
Member

More updates, as updates are fun:

Turns out the CNI (flannel) lost its ways. 1 node has a subnet the others don't know about, and the others have one nobody knows about. So any traffic that needs to go to the first node from any other will get lost. Luckily the first node does have full view of everything, and this is also why the traffic currently is arriving where it should.

I updated my ticket with DigitalOcean. I am going to give them some time to look into this too, as it is a problem of the managed service they deliver (and not my/our mistake, basically). Hopefully they know how to resolve this cleanly, and otherwise I will be rebuilding the cluster (only takes ~15 minutes, so that is okay).

At least I now understand the issue. Just no clue how/what caused it. Hopefully DO can answer that.

@TrueBrain
Copy link
Member

TrueBrain commented Jul 27, 2019

Problem is resolved. Final recap:

We run Kubernetes on 2 nodes. On the 25th of Jul, at around 03:30 UTC, the host on which one of the nodes runs had hardware issues, and DigitalOcean wanted to migrate all Droplets away from it. This is pretty common, and what I expect of a managed service. During this migration, the node crashed and rebooted itself. Kubernetes has no problems losing nodes, and recovers nearly instant.

During the reboot, flannel assigned (most likely because a file on disk got corrupted) a new IP range to this node. But the docker on the node was still configured for the old IP. In result, services created on that node were bind on the wrong IP. Even reboots didn't fix this.

Initial the other node was rebooted, as that was showing issues. As it turns out, all connections that came in via the first node were routed just fine: he knew his own IP, and could route to the IP of the other node. The other was around however, was not working. In result, the healthy node appeared unhealthy, and the other way around.

After a long tracing session (so happy most docker images have a curl or wget available. Please always add either of these to any of your Dockerfile. Even if it is wget via busybox!), it was found that kube-proxy was not forwarding messages correctly. This resulted in the above conclusion, where flannel and docker were no longer playing nice.

As temporary mitigation, already a third node was added to the cluster. This heavily reduced the amount of errors, as odds of hitting a stale connection went from 75% to 22% (I won't bore you with the details why an additional node caused this). Later we simply told the LoadBalancers to ignore everything but the first (unhealthy) node. At least he could always route the traffic correctly.

As permanent solution, the unhealthy node was terminated. Kubernetes recovers from this easily, and all traffic is moving again how it should. The LoadBalancer now route their traffic via the other two nodes. Those two nodes to know how to reach each other, and everything is working as expected again.

Attached a nice image to show how the healthchecks on the LoadBalancers were bouncing around. On top the unhealthy node, and the other two lines were the two other nodes.
image

Any other outages or problems, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants