Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/kubernetes: stabilize cluster deployment/startup across machines #56789

Merged
merged 24 commits into from Apr 17, 2019

Conversation

calbrecht
Copy link
Member

@calbrecht calbrecht commented Mar 3, 2019

Motivation for this change

I had a tough time getting to know the kubernetes with all the errors in the logs caused by services dying because of missing required certificates (because they were not generated yet) or simply because services depend on other services which were not started yet.

This PR makes all the services related to kubernetes start in order, even across machines and ensures the requirements for the services are met in advance of starting. The kubernetes tests had 28K lines of log before ran about 4m 20s and are now down to 13K and 3m 40s. All service start clean on the first boot.

Please see the commit messages for further information about the changes.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

to protect services from crashing and clobbering the logs when
certificates are not in place yet and make sure services are activated
when certificates are ready.

To prevent errors similar to "kube-controller-manager.path: Failed to
enter waiting state: Too many open files"
fs.inotify.max_user_instances has to be increased.
by adding targets and curl wait loops to services to ensure services
are not started before their depended services are reachable.

Extra targets cfssl-online.target and kube-apiserver-online.target
syncronize starts across machines and node-online.target ensures
docker is restarted and ready to deploy containers on after flannel
has discussed the network cidr with apiserver.

Since flannel needs to be started before addon-manager to configure
the docker interface, it has to have its own rbac bootstrap service.

The curl wait loops within the other services exists to ensure that when
starting the service it is able to do its work immediately without
clobbering the log about failing conditions.

By ensuring kubernetes.target is only reached after starting the
cluster it can be used in the tests as a wait condition.

In kube-certmgr-bootstrap mkdir is needed for it to not fail to start.

The following is the relevant part of systemctl list-dependencies

default.target
● ├─certmgr.service
● ├─cfssl.service
● ├─docker.service
● ├─etcd.service
● ├─flannel.service
● ├─kubernetes.target
● │ ├─kube-addon-manager.service
● │ ├─kube-proxy.service
● │ ├─kube-apiserver-online.target
● │ │ ├─flannel-rbac-bootstrap.service
● │ │ ├─kube-apiserver-online.service
● │ │ ├─kube-apiserver.service
● │ │ ├─kube-controller-manager.service
● │ │ └─kube-scheduler.service
● │ └─node-online.target
● │   ├─node-online.service
● │   ├─flannel.target
● │   │ ├─flannel.service
● │   │ └─mk-docker-opts.service
● │   └─kubelet.target
● │     └─kubelet.service
● ├─network-online.target
● │ └─cfssl-online.target
● │   ├─certmgr.service
● │   ├─cfssl-online.service
● │   └─kube-certmgr-bootstrap.service
to speed up startup time because it can be parallelized.
to prevent errors in log about missing permissions when
addon manager starts the dashboard.
within the node join script, since certmgr is taking care of
restarting services.
@calbrecht
Copy link
Member Author

calbrecht commented Mar 3, 2019

cc @johanot @globin @fpletz

@johanot
Copy link
Contributor

johanot commented Mar 4, 2019

@calbrecht Thank you for this! Looks good at first glance, and tests pass.
Will do a comprehensive review later today.

@arianvp
Copy link
Member

arianvp commented Mar 4, 2019

Would be nice if kubernetes components would support socket-based activation ... then we would get correct ordering of startup for free id we'd use systemd.socket. but I'm not sure if they do..

@calbrecht
Copy link
Member Author

@arianvp there is an open issue in the kubernetes repo about socket-based activation, but it does not apply to starting core components, i figure.

Copy link
Contributor

@johanot johanot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calbrecht This was what I had for now :-) Quite a bit actually. Feel free to catch me on IRC if you want to discuss some of it in a more informal fashion.

echo "Restarting flannel..." >&1
systemctl restart flannel
''}
if [ y = $do_restart ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bash nitpick:
Couldn't this be written as just if [ -s "${certmgrAPITokenPath}" ]; then, and then line 378 can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my intention was like if the file is present and not empty before running the node-join script then restart certmgr, else the systemd path unit will take care of starting certmgr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me then.

fi
'')
];
serviceConfig = {
RestartSec = "10s";
TimeoutSec = "500";
RestartSec = "1s";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this to try and restart every second without no StartLimit or anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed restart all together in 5684034 since the service has a preStart wait loop anyway.

# because the network of docker0 has been changed by flannel.
script = let
docker-env = "/run/flannel/docker";
flannel-date = "stat --print=%Y ${docker-env}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh. This is not very nice, but neither was my restart-dance, and for now I have no prettier suggestion. :-) It is certainly complicated the way flannel negotiates the pod CIDR. We should get rid of flannel as default CNI in the future, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i am not experienced in networking, whatever you say, i am in for it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said, I have no better suggestion :-) It just seems a little flaky to rely on timestamps for getting the restart order right. Perhaps the only right solution is to replace Flannel with something else, but that's a toy project for another time.

apiGroup = "rbac.authorization.k8s.io";
kind = "ClusterRole";
name = "flannel";
systemd.services.flannel-rbac-bootstrap = mkIf (top.apiserver.enable && (elem "RBAC" top.apiserver.authorizationMode)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why you changed this? Flannel doesn't need RBAC permissions if it doesn't use kubernetes as storage backend, which is why the mkIf previously has a guard for the flannel storage backend option. Futhermore; the addonManager bootstrap addons were used to assign privileges, but you created a separate service - why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the storageBackend always was "kubernetes" like declared in line 10 of this file. I created this because in my view the flannel service would need to start before any addons might be deployed to nodes. And nodes couldn't be found before the flannel service configured the network.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 6e9037f now there is a new service kube-addon-manager-bootstrap.service which has the purpose to have higher permissions to invoke the kubectl apply and not to conflict with the mkWaitCurl of the kube-addon-manager.service which does not need to have clusterAdmin certs but the addonManager certs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is my mistake that I forgot to mkDefault the storageBackend in line 40, but that doesn't change the fact that we should always be able to handle/support the situation where someone decides to do storageBackend = mkForce "etcd". This is why the mkIf on storageBackend is needed. For the same reason that we cannot assume that easyCerts is always true.

Anyway, the new version of the diff looks better. I see why a new service was needed now, because of the order of things starting. :-)

inherit cert key;
})}

kubectl -s ${top.apiserverAddress} --certificate-authority=${top.caFile} --client-certificate=${top.pki.certs.clusterAdmin.cert} --client-key=${top.pki.certs.clusterAdmin.key} apply -f ${concatStringsSep " \\\n -f " files}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why assigning RBAC rights is nice to have isolated in the "bootstrap-addons" loop. We don't want too many individual units to have access to the clusterAdmin certificates, or run kubectl commands against the cluster. In either case, this unit poses a problem for clusters where easyCerts = false. We cannot reference top.pki.certs without guarding for mkIf top.easyCerts. Preferably, everything easyCerts related should just be declared in pki.nix.

Copy link
Member Author

@calbrecht calbrecht Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i get the point about the access to clusterAdmin certs and the easyCerts.

I just found by delaying the startup of kube-addon-manager until it can do its work, that flannel needs its roles earlier.

I also created a different oneshot service for the bootstrapAddons set to activate this as soon as the apiserver was online but threw it away because i was confused if there would be any "pod creation requests" involved.

The bootstrapAddons confuses me when thinking about the startup order, i'd rather think of it as a bootstrapAccessControl or similar to just do this only on the master right after the kube-apiserver is working, which is the intention of bootstrapAddons, right?

Preferably, everything easyCerts related should just be declared in pki.nix.

Agree, do you think of the mkCert calls here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the bootstrapping in 6e9037f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mkCert I think can stay here, because it doesn't require easyCerts to be true, and it fits well in the same file as the component needing the certs.


systemd.services.kube-apiserver-online = mkIf top.flannel.enable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the mkIf on flannel here? What happens if I disable flannel and go with another CNI? Will dependencies on kube-apiserver-online still work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i see. The mkIf currently is there because the mkWaitCurl is using the certifificates of the flannel client to check the apiserver /healthz path.
I decided to use the flannel certs because it seemed to me that it will always be present on the master and nodes.

So maybe on the master the mkWaitCurl could be called with the cluster-admin cert, and maybe this will work with the kubelet-client cert on the nodes (not sure if this is working because the node somehow is not present at this point, but will try)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 7323b77#diff-8d468cb36655335308a8eb4c3999b59aR372 by either using the clusterAdmin cert or the kubelet cert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here: 7323b77#r32635029

systemd.services.kube-apiserver-online = mkIf top.flannel.enable {
description = "apiserver control plane is online";
wantedBy = [ "kube-apiserver-online.target" ];
after = [ "kube-scheduler.service" "kube-controller-manager.service" ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, what's confusing me here is, that there is a kube-apiserver-online.service and a kube-apiserver-online.target. I think (maybe, if I understood the intention correctly), the latter should be renamed to kube-control-plane-online.target, and that target should contain all the master components: kube-apiserver, kube-controller-manager, kube-scheduler, kube-addon-manager.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will rename it, but i'd prefer to not include the addon-manager and start that one after node-online.target for the reasons i gave above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in ff91d58

@@ -72,7 +72,7 @@ in
systemd.services.kube-addon-manager = {
description = "Kubernetes addon manager";
wantedBy = [ "kubernetes.target" ];
after = [ "kube-apiserver.service" ];
after = [ "kube-apiserver-online.target" "node-online.target" ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addon manager doesn't require node-online does it? Isn't it just about submitting something to the apiserver. Whatever is submitted will be scheduled later, when one or more nodes come online.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason i made the addon-manager depend on node-online.target is that i wanted to make sure fannel already configured and restarted docker at the time of requesting new containers to be started.
I thought if they were requested earlier they had to be destroyed and restarted to actually be connected to the newly created subnet. And if this is a non-issue then it is just a matter of my preference to not start services that cannot function correctly and therefore just spam the logs with warnings about not being able to find the node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the addon-manager cannot do anything if there are no nodes online. Or do i lack the deeper understanding here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It activates now within kube-control-plane-online.target ff91d58#diff-e6ede1705ae680d8146d5db307a2a392R74

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the addon manager does is submit stuff to the apiserver (etcd ultimately). If some of that "stuff" requires sheduling (i.e. pods), the kube-scheduler and kube-controller-manager will handle it from there. If there are no nodes online yet, the pods will simply end up in status Pending until at least one node shows up, but this is not something the addon manager should wait for. :-) All it requires is the apiserver. I see you new version of the diff implements that. Fine.

@@ -354,5 +405,16 @@ in
};
})

{
systemd.targets.kubelet = {
wantedBy = [ "node-online.target" ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be renamed to kube-node-online.target ? Just node-online.target can be confusing on multi-purpose hosts. Also; I think that kube-proxy should be part of this target.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure i will rename it. And for the kube-proxy i will check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in e148cb0 and the proxy in ff91d58

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@@ -272,11 +272,30 @@ in
###### implementation
config = mkMerge [

(mkIf cfg.enable {
(mkIf cfg.enable (let
apiserverPaths = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear there might be an issue regarding compatibility with easyCerts = false here? It should be allowed to run a less secure cluster with AlwaysAllow as authorizer. I this we should consider putting some of these path-lists in pki.nix, to make it possible to opt-out of some of these path dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i understand. Will move all occurrences into pki.nix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the remaining path units and usage in ff382c1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will have to test without easyCerts to see whether everything is still compatible.

@arianvp
Copy link
Member

arianvp commented Mar 5, 2019

There's also this: https://www.freedesktop.org/software/systemd/man/systemd-socket-proxyd.html which retrofits socket activation on any non-socket-activated service

@@ -73,6 +73,18 @@ let
};
};

mkWaitCurl = { address ? cfg.apiserverAddress, sleep ? 2, path ? "", args ? "-o /dev/null",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed by accident, when deploying a different deployment with nixops, nix complains about cfg.apiserverAddress being used but not defined. Probably because it has no default value in the options. Have to workaround that somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calbrecht apiserverAddress is mkDefault'et to the value of masterAddress here: https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/cluster/kubernetes/default.nix#L286 .. The reason is the assumption that you have only one master node, but it should still be possible to override this. In our production cluster for example; we have 3 master nodes and load-balanced address that divides requests between them all. In that case, we want a setup where masterAddress != apiserverAddress.

Copy link
Member Author

@calbrecht calbrecht Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanot Ok, so you have the cfssl service running on the masterAddress, right?

However, the issue here was that i defined kube-control-plane-online service and target within apiserver.nix but outside of the mkIf cfg.enabled set, which led to evaluation of this function declaration in every other nix-build, probably. Fixed this in 154356d

@globin globin merged commit 7dc6e77 into NixOS:master Apr 17, 2019
@globin
Copy link
Member

globin commented Apr 17, 2019

Looks good!

@johanot
Copy link
Contributor

johanot commented Apr 22, 2019

@calbrecht Sorry I went kind of offline on this one and didn't help seeing it through. Anyway, now I'm "awake" again :-)

It seems, unfortunately that the DNS and RBAC test cases stopped working on master after merging this. Looks like kube-control-plane-online-pre-start is hanging. Have you observed this?

@johanot
Copy link
Contributor

johanot commented Apr 22, 2019

kubectl get --raw=/healthz prompts for username, because the kubeconfig passed to kube-control-plane-online.service has errors.

@calbrecht
Copy link
Member Author

calbrecht commented Apr 24, 2019

@johanot, what errors do you mean? When reverting the kubernetes version upgrades back to 1.13.4, the tests run successful. Has this to do with restoring --user and --password flags in the changes from 1.13.4 -> 1.13.5?

@johanot
Copy link
Contributor

johanot commented Apr 24, 2019

@calbrecht Looks like your assumption is right. I does indeed break with 1.13.4 -> 1.13.5. It is sometimes baffling to me, what is apparently allowed to do in a minor kubernetes release. I mean.. It clearly breaks the API, but that's somehow ok. :) Just confirms to me that we need to be careful even with a revision bump of kubernetes.

The "error" I was talking about, is that it looks like client cert is null. For some reason, this was ok in 1.13.4, but not in 1.13.5. I understand the prompt for username, since the kubeconfig doesn't contain proper user identification. Will you look into this, or shall I?

@calbrecht
Copy link
Member Author

@johanot would be great if you look into it because i am unsure about how to proceed. Whether to generate a new cert or take an existing one, and then if its ok to rely on the pki within default.nix, etc.

johanot pushed a commit to johanot/nixpkgs that referenced this pull request Aug 24, 2019
…factor"

This reverts commit 7dc6e77, reversing
changes made to bce47ea.
srhb pushed a commit to johanot/nixpkgs that referenced this pull request Sep 4, 2019
…factor"

This reverts commit 7dc6e77, reversing
changes made to bce47ea.

Motivation for the revert in NixOS#67563
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/kubernetes-network-malfunction-after-upgrading-to-19-09/4620/1

dtzWill pushed a commit to dtzWill/nixpkgs that referenced this pull request Jan 27, 2020
…factor"

This reverts commit 7dc6e77, reversing
changes made to bce47ea.

Motivation for the revert in NixOS#67563

(cherry picked from commit 00975b5)
@calbrecht calbrecht deleted the upstream-k8s-refactor branch January 4, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants