Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable nscd caching #50316

Merged
merged 8 commits into from Dec 12, 2018
Merged

Disable nscd caching #50316

merged 8 commits into from Dec 12, 2018

Conversation

arianvp
Copy link
Member

@arianvp arianvp commented Nov 13, 2018

Motivation for this change

Nscd caching was interferring with systemd's nss module. Disable caching as
caching is not the reason why we use nscd. We use it to resolve nss modules correctly. See commit message for more
detailed information.

Fixes #50273

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Fits CONTRIBUTING.md.

@arianvp arianvp force-pushed the fix-dynamic-user branch 3 times, most recently from 67b7153 to f667a18 Compare November 13, 2018 16:15
@arianvp arianvp changed the title Fix dynamic user Disable nscd caching Nov 13, 2018
@arianvp
Copy link
Member Author

arianvp commented Nov 13, 2018

Supersedes #50042 if we also disable host caching

@arianvp
Copy link
Member Author

arianvp commented Nov 13, 2018

CC @Mic92 . We could also replace it with unscd while we're at it.

@arianvp
Copy link
Member Author

arianvp commented Nov 13, 2018

I haven't actually tried setting enable-cache no but just took @dezgeg 's word for it that it only works if we set ttls to 0 . We do the same when we enable sssd. I also elaborated this in the commit message but haven't actually tested this hypothesis. I could also try fully disabling the caches.

@Mic92
Copy link
Member

Mic92 commented Nov 13, 2018

I think we should first disable negative-time-to-live caches for hosts:

negative-time-to-live hosts 0

This should fix problems people have when they cannot reach a site, when the hostname was resolved while having no connectivity.
And then test unscd for a while. It could have even funnier bugs.

@lheckemann
Copy link
Member

I think this is worth mentioning in the release notes.

@peterhoeg
Copy link
Member

I think we should first disable negative-time-to-live caches

We might as well disable the positive TTL too and say that nscd is not for caching. Whenever something unexplained happens with name resolution, my first step is always just to restart nscd which solves it nine times out of 10.

@edolstra
Copy link
Member

Hm, what's the advantage of disabling caching of positive results?

@arianvp
Copy link
Member Author

arianvp commented Nov 14, 2018

Basically I wanted to be 100% sure nothing fishy is happening. I disabled everything to be 100% sure no funny stuff is happening. Systemd project acknowledged that they are not nscd-aware and they might look into fixing that. This means there might be more subtle bugs lurking in the presence of caching. I'm not sure if the positive cache is invalidated when a systemd unit with Dynamic users stops. If it doesn't, this might cause funny bugs when another systemd service with Dynamic users recycles that uid and discovers there's already a name associated to the uid.

user lookups are basically instantaneous and the only moment we would want to cache them is in the presence of LDAP to improve performance. But sssd already does its own caching and so we already disable nscd caching when sssd is enabled in NixOS and Red Hat (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/usingnscd-sssd)

So the only moment we seem to have caching is when the passwd database is backed by a file. I think this is a bit overkill and will cause more issues than gains.

Edit: @edolstra oh I think I now just realised you were talking about caching of hosts, not the other caches.

@arianvp
Copy link
Member Author

arianvp commented Nov 14, 2018

I've disabled negative caching for hosts for now. and left positive caching in.

I'm not very familiar with the release process of NixOS. What should a release notes entry look like, and where do I put it?

@Mic92
Copy link
Member

Mic92 commented Nov 14, 2018

@arianvp
Copy link
Member Author

arianvp commented Nov 14, 2018

Now that sssd and nscd ship basically the same config, is it worth to remove the settings from sssd ?

@Mic92
Copy link
Member

Mic92 commented Nov 14, 2018

@arianvp when they are 100% the same, we can drop the sssd one.

suggested-size group 211
check-files group yes
persistent group no
shared group yes

enable-cache hosts yes
positive-time-to-live hosts 600
negative-time-to-live hosts 5
negative-time-to-live hosts 0
Copy link
Member

@Mic92 Mic92 Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we at least should backport this one, since it can be quiet annoying if websites do not load correctly after connecting to a hotspot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I’ve had this problem for more than a month now, so the release must have it as well.

@peterhoeg
Copy link
Member

Hm, what's the advantage of disabling caching of positive results?

We don't gain anything by having it enabled and disabling it removes one potential source of problems.

<listitem>
<para>
The <literal>nscd</literal> now disables all caching of
<literal>passwd</literal> and <literal>group</litral> databases by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be </literal> the closing tag in group

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nix build -f my-nixpkgs/nixos config.system.build.manual --arg configuration {} to make sure the manual builds :)

@Mic92
Copy link
Member

Mic92 commented Nov 15, 2018

@peterhoeg we gain a dns cache if it is enabled. I don't see how this would cause problems if it caches valid entries. DNS is designed around for those caches and uses TTLs itself to control invalidation.

@arianvp
Copy link
Member Author

arianvp commented Dec 5, 2018

@edolstra I reverted your revert, as I don't think it fixes anything. I'm 100% sure that nscd already invalidates the cache when /etc/resolv.conf is changed, through inotify.

@arianvp
Copy link
Member Author

arianvp commented Dec 5, 2018

After reading the discussion at #42569 perhaps we should only unrevert the commit Eelco did, after we fully disable caching in nscd. I added Eelco's revert to the patchset again ...

I think we should merge this as is, and then in a next PR:

  1. introducde systemd-resolved
  2. Disable positive caching of domains in nscd
  3. Remove all forms of restart triggers / cache invalidations of nscd

@flokli flokli mentioned this pull request Dec 6, 2018
14 tasks
<literal>libnss_systemd.so</literal> module which is used by
<literal>systemd</literal> to manage uids and usernames in the presence
of <literal>DynamicUser=</literal> in systemd services.
The was already the default behaviour in presence of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The was already the default behaviour in presence of
This was already the default behaviour in presence of

Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more doc feedback

@arianvp
Copy link
Member Author

arianvp commented Dec 12, 2018

I have addressed the doc issues. I think this is ready for merge. I have created a follow-up ticket with the rest of the tasks that popped up in this discussion #51911

@@ -4,25 +4,41 @@ paranoia no
debug-level 0

enable-cache passwd yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to disable caches where we set a zero ttl for both positive and negative? Seems less confusing to me :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As eelco mentioned, this would have a serious performance penalty. so I don't want to change it until #51911 is implemented. I think this is a good compromise where people do not get failed lookups when switching networks, but do also have performance. Until we figure out the resolved business

Copy link
Member Author

@arianvp arianvp Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I misread: See the original commit message why we can't do this:

Note that we can not just put in /etc/nscd.conf:
enable-cache passwd no

As this will actually cause glibc to _not_ forward the call to nscd
at all, and thus never reach the nss modules. Instead we set
the negative and positive cache ttls  to 0 seconds as a workaround.
This way, Glibc will always forward requests to nscd, but results
will never be cached.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a comment to nscd.conf to clarify

Systemd provides an option for allocating DynamicUsers
which we want to use in NixOS to harden service configuration.
However, we discovered that the user wasn't allocated properly
for services. After some digging this turned out to be, of course,
a cache inconsistency problem.

When a DynamicUser creation is performed, Systemd check beforehand
whether the requested user already exists statically. If it does,
it bails out. If it doesn't, systemd continues with allocating the
user.

However, by checking whether the user exists,  nscd will store
the fact that the user does not exist in it's negative cache.
When the service tries to lookup what user is associated to its
uid (By calling whoami, for example), it will try to consult
libnss_systemd.so However this will read from the cache and tell
report that the user doesn't exist, and thus will return that
there is no user associated with the uid. It will continue
to do so for the cache duration time.  If the service
doesn't immediately looks up its username, this bug is not
triggered, as the cache will be invalidated around this time.
However, if the service is quick enough, it might end up
in a situation where it's incorrectly reported that the
user doesn't exist.

Preferably, we would not be using nscd at all. But we need to
use it because glibc reads  nss modules from /etc/nsswitch.conf
by looking relative to the global LD_LIBRARY_PATH.  Because LD_LIBRARY_PATH
is not set globally (as that would lead to impurities and ABI issues),
glibc will fail to find any nss modules.
Instead, as a hack, we start up nscd with LD_LIBRARY_PATH set
for only that service. Glibc will forward all nss syscalls to
nscd, which will then respect the LD_LIBRARY_PATH and only
read from locations specified in the NixOS config.
we can load nss modules in a pure fashion.

However, I think by accident, we just copied over the default
settings of nscd, which actually caches user and group lookups.
We already disable this when sssd is enabled, as this interferes
with the correct working of libnss_sss.so as it already
does its own caching of LDAP requests.
(See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/usingnscd-sssd)

Because nscd caching is now also interferring with libnss_systemd.so
and probably also with other nsss modules, lets just pre-emptively
disable caching for now for all options related to users and groups,
but keep it for caching hosts ans services lookups.

Note that we can not just put in /etc/nscd.conf:
enable-cache passwd no

As this will actually cause glibc to _not_ forward the call to nscd
at all, and thus never reach the nss modules. Instead we set
the negative and positive cache ttls  to 0 seconds as a workaround.
This way, Glibc will always forward requests to nscd, but results
will never be cached.

Fixes NixOS#50273
It was the last database that wasn't listed.
@flokli
Copy link
Contributor

flokli commented Dec 12, 2018

LGTM.

@Mic92 Mic92 merged commit 5feba45 into NixOS:master Dec 12, 2018
@Mic92
Copy link
Member

Mic92 commented Dec 12, 2018

I backported @edolstra revert in 25d0880, which fix the biggest pain point we had at the moment. Before backporting a change like this one, I would give a test period on unstable. We might not need to backport at all since the problems with DynamicUser should not affect many people.

@flokli
Copy link
Contributor

flokli commented Feb 4, 2019

@arianvp Had a discussion at FOSDEM about how we (ab)use nscd to pass nss modules, as described in 1d5f4cb.

This seems to be a bad idea, due to glibc falling back to local resolving (with broken nss modules) after very quick timeouts.

Can we get glibc to load nss modules from some global /run/nss-modules path instead, pointing to something in the nix-store?

@Mic92
Copy link
Member

Mic92 commented Feb 4, 2019

@flokli This can break if a program was linked against a different version of glibc
since. This is why we use nscd in the first place.

@arianvp
Copy link
Member Author

arianvp commented Feb 5, 2019

Only in the case where glibc breaks abi right? We do something similar for OpenGL/Vulkan iirc

@flokli
Copy link
Contributor

flokli commented Feb 5, 2019

I opened #55276 about that.

@Mic92
Copy link
Member

Mic92 commented Feb 5, 2019

@arianvp This could make it impossible to mix releases, but not just for OpenGL applications but every application.

flokli added a commit to flokli/nixpkgs that referenced this pull request May 31, 2020
This effectively disables nscd's built-in hosts cache, which turns out
to be erratic in some cases.

We only use nscd these days as a more ABI-neutral NSS dispatcher
mechanism.

Local caching should still be possible with local resolvers in
/etc/resolv.conf (via the `dns` NSS module), or without local resolvers
via systemd-networkd (via the `resolve` nss module)

We don't set enable-cache to no due to
NixOS#50316 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DynamicUser is broken
10 participants