Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2162: use getaddrinfo instead of curl to preload NSS #2224

Merged
merged 1 commit into from Jun 12, 2018
Merged

Fix #2162: use getaddrinfo instead of curl to preload NSS #2224

merged 1 commit into from Jun 12, 2018

Conversation

yorickvP
Copy link
Contributor

Currently, a curl download of an invalid domain is used to trigger NSS library preload (for the sandbox).
This is used without a trailing . (and url strips out the trailing .s), meaning this domain could resolve. In my case, it would take 1.5 minutes to time out.
Hopefully this retains the preloading functionality while removing the delays.

@yorickvP
Copy link
Contributor Author

@vcunat @copumpkin @dtzWill

@copumpkin
Copy link
Member

How sure are you that this is equivalent? I really should've constructed a test when I first put together the "fix" 😄but now I can't remember the exact circumstances with it.

I think it has something to do with turning sandboxes on, while using builtins.fetchurl with an uncached value. I can try to dig up more specifics.

In principle though, this looks pretty good and seems better than what I did.

@yorickvP
Copy link
Contributor Author

Well, I can't reproduce the bug with my patch.
I tried
nix-build '<nix/fetchurl.nix>' --argstr url https://example.com --argstr sha256 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 --option sandbox true

@copumpkin
Copy link
Member

I guess I just mean you were seeing the bug without your patch and without the curl invocation? If so, that sounds great.

@yorickvP
Copy link
Contributor Author

No, I didn't try reproducing the bug without the patch. I'll try.
meanwhile, better test:
nix-build '<nix/fetchurl.nix>' --argstr url https://example.com?$(cat /dev/urandom | tr -dc 'a-z0-9' | head --bytes 4) --argstr sha256 08y4734bm2zahw75b16bcmcg587vvyvh0n11gwiyir70divwp1rm --option sandbox true

@yorickvP
Copy link
Contributor Author

I also can't reproduce this without the preloadNSS function.

@copumpkin
Copy link
Member

Oh well, let's just merge and see.

@copumpkin copumpkin merged commit 169e147 into NixOS:master Jun 12, 2018
@yorickvP yorickvP deleted the getaddrinfo-preload branch June 12, 2018 13:18
@dtzWill
Copy link
Member

dtzWill commented Jun 12, 2018

Details on the original issue (for preloadNSS) would be useful.

I'm concerned about this fix--there's a reason networking libraries and such have caches for DNS lookups and perform these operations asynchronously...

Notably getaddrinfo is a blocking call, and is essentially unbounded in how long it might take...
sending queries over the network if can't be resolved locally. Delays of 5 seconds or more are not unusual. This seems unfortunate price to pay when we don't even want the DNS lookup but are hoping that indirectly achieves some other ends...?

I don't have a great answer for what should be done instead, as it's unclear what this should be solving anyway.

I trust there is, or was, an issue that we should fix but:

  • fetchurl isn't the only builtin
  • this will likely send packets on every nix invocation, sometimes taking significant time if a resolver is missing or slow (among other reasons, Google around)

If we can identify what the problem is, perhaps we can find a better way to solve it...?

(If not it seems this should just be deleted, and either everything will be fine or it will identify what needs fixing)

@copumpkin
Copy link
Member

copumpkin commented Jun 12, 2018

Not sure why I didn't file an issue at the time, but here's some more context I dug up from some IRC logs:

[2018-01-23T13:06:06-0500] <copumpkin> niksnut: still curious about the other thing I was asking the other day: why is builtin:fetchurl even run in a child if it's purely nix code? it seems like we can trust that it'll do the right thing
[2018-01-23T13:06:19-0500] <copumpkin> can't it just enqueue a download in the parent process and be done with it?
[2018-01-23T13:06:34-0500] <copumpkin> that would avoid this issue and maybe simplify the whole thing
[2018-01-23T13:13:12-0500] <copumpkin> niksnut: I don't see anything in the NSS machinery to preload stuff, except by actually performing a resolution. I could ask gethostbyname("nixos.org") at startup and not pay attention to the result, I suppose :P
[2018-01-23T13:20:21-0500] <dtz> aw man I remember doing that WAY back to get some stuff working on iOS lol
[2018-01-23T13:21:16-0500] <dtz> http://www.saurik.com/id/3 <-- I'm the "Will Dietz" referred to lol
[2018-01-23T13:21:29-0500] <dtz> oh I suppose I just issued dummy HTTP requests, not dummy gethostbyname() calls, lol
[2018-01-23T13:21:36-0500] <dtz> anyway really can't imagine it's the same but :D
[2018-01-23T13:21:49-0500] <copumpkin> nice! fellow ex-jailbreakers unite
[2018-01-23T13:28:08-0500] <copumpkin> ooh niksnut  nss_load_all_libraries might be good
[2018-01-23T13:29:09-0500] <copumpkin> I wonder if we're supposed to call that though
[2018-01-23T13:29:25-0500] <copumpkin> technically I don't think we are, but how bad is it if we do?
[2018-01-23T13:32:31-0500] <Dezgeg> nss_load_all_libraries is static
[2018-01-23T13:32:48-0500] <copumpkin> yeah seeing that now
[2018-01-23T13:32:49-0500] <copumpkin> sigh
[2018-01-23T13:33:29-0500] <copumpkin> as is nss_load_library
[2018-01-23T13:34:13-0500] <copumpkin> __nss_disable_nscd could work if we went and undid all the global variables it set
[2018-01-23T13:34:56-0500] <copumpkin> super hacky though
[2018-01-23T13:34:58-0500] <Dezgeg> I think you really just need to do the dummy hostname lookup
[2018-01-23T13:34:58-0500] <copumpkin> probably easier just to resolve something and let NSS do its thing
[2018-01-23T13:35:28-0500] <Dezgeg> it's not like builtin:fetchurl is going to be very common
[2018-01-23T13:35:32-0500] <copumpkin> niksnut: that isn't beautiful, but would you accept it?
[2018-01-23T13:35:55-0500] <copumpkin> it seems like not sticking builtin:fetchurl into a child might be a cleaner long-term solution
[2018-01-23T13:36:06-0500] <copumpkin> or if we do, just disabling its sandbox
[2018-01-23T13:44:24-0500] <niksnut> copumpkin: well, I thought about doing builtin:fetchurl in a thread, but we would have to be careful about filtering out file:// etc.
[2018-01-23T13:44:35-0500] <niksnut> in general, running it as root is not a great idea
[2018-01-23T13:44:36-0500] <copumpkin> ah
[2018-01-23T13:44:44-0500] <copumpkin> I see
[2018-01-23T13:44:51-0500] <copumpkin> that makes sense, I guess
[2018-01-23T13:45:08-0500] <copumpkin> so I'll just resolve nixos.org early on and ignore the output?
[2018-01-23T13:45:18-0500] <copumpkin> can't think of a nicer way to do this that isn't horribly fragile
[2018-01-23T13:49:06-0500] <niksnut> just download http://invalid.invalid
[2018-01-23T13:49:45-0500] <copumpkin> you mean enqueue a download of it using the downloader on the parent process?
[2018-01-23T13:50:22-0500] <copumpkin> I guess invalid to force it to cycle through all possible host resolvers?
[2018-01-23T13:50:53-0500] <niksnut> try { getDownload()->download("http://invalid.invalid"); } catch (...) { }
[2018-01-23T13:51:10-0500] <niksnut> and do it in build.cc, not at startup
[2018-01-23T13:52:15-0500] <copumpkin> like at the top of startBuilder()?
[2018-01-23T13:53:18-0500] <copumpkin> gimme a bit to adjust my setup to use a custom Nix
[2018-01-23T13:53:33-0500] <niksnut> yeah
[2018-01-23T13:53:46-0500] <niksnut> conditional even on isBuiltin()
[2018-01-23T13:54:12-0500] <niksnut> and wrapped in std::call_once
[2018-01-23T13:59:26-0500] <copumpkin> cool, will try
[2018-01-23T14:04:41-0500] <gchristensen> it sort of looks like mac1 is trying to ask permission to do something, like install updates and reboot
[2018-01-23T14:15:22-0500] <copumpkin> niksnut: running a test with https://github.com/copumpkin/nix/commit/7ca2bd74e1c25a77a50a39dee15404518b194736 now, hold your breath :P
[2018-01-23T14:25:42-0500] <copumpkin> niksnut: I think that worked, but now I just need to get it to shut up about being unable to resolve the invalid URL :P
[2018-01-23T14:26:06-0500] <copumpkin> can I suppress those warnings?
[2018-01-23T14:26:45-0500] <copumpkin> also, it retries a few times, which is probably unnecessary
[2018-01-23T14:37:34-0500] <niksnut> copumpkin: maybe put that in a separate function, like preloadNSS()
[2018-01-23T14:37:45-0500] <niksnut> I think you can specify a retry count
[2018-01-23T14:38:01-0500] <niksnut> in request
[2018-01-23T14:38:07-0500] <copumpkin> yeah, I'm turning .tries = 1
[2018-01-23T14:38:15-0500] <copumpkin> but the warning is a bit less clear
[2018-01-23T14:38:33-0500] <niksnut> maybe request needs a flag to suppress warnings
[2018-01-23T14:38:56-0500] <copumpkin> I proposed calling it `bool stfu` in the PR :P
[2018-01-23T14:39:17-0500] <copumpkin> feels a bit awkward but I can do that if you think it's right
[2018-01-23T14:39:20-0500] <copumpkin> (obviously kidding about the name)
[2018-01-23T14:41:16-0500] <niksnut> where does the warning come from anyway?
[2018-01-23T14:42:08-0500] <copumpkin> https://github.com/NixOS/nix/blob/master/src/libstore/download.cc#L329 followed by https://github.com/NixOS/nix/blob/master/src/libstore/download.cc#L335
[2018-01-23T14:42:19-0500] <copumpkin> so I get
[2018-01-23T14:42:20-0500] <copumpkin> warning: unable to download 'http://this.pre-initializes.the.dns.resolvers.invalid': Couldn't resolve host name (6); retrying in 285 ms
[2018-01-23T14:42:40-0500] <niksnut> well, if tries = 1 it shouldn't print that
[2018-01-23T14:43:09-0500] <copumpkin> true
[2018-01-23T14:46:47-0500] <copumpkin> how's this? https://github.com/NixOS/nix/pull/1813/files
[2018-01-23T14:46:52-0500] <copumpkin> gonna run another test to make sure it's quiet
[2018-01-23T14:53:00-0500] <copumpkin> niksnut: looks good and quiet now :D
[2018-01-23T14:57:00-0500] <copumpkin> I suppose you might prefer the once_flag being a static variable inside preloadNSS?
[2018-01-23T15:05:11-0500] <niksnut> copumpkin: thanks!
[2018-01-23T15:06:03-0500] <copumpkin> thank you (and dezgeg) for all the help tracking it down :)
[2018-01-23T15:06:48-0500] <copumpkin> I feel very important to this project: it's essential to have someone very unlucky bug testing stuff :P
[2018-01-23T15:27:30-0500] <copumpkin> I'm gonna trigger a hydra nix eval and then update nixUnstable in nixpkgs

Trying to dig up more context on when this arises from the logs.

@copumpkin
Copy link
Member

Okay, here's much more context:

[2018-01-23T09:07:48-0500] <copumpkin> niksnut: I'm making progress on my mysterious sandbox bug. It only shows up with --check! Still no clean repro unfortunately, but I think I'm honing in on it
[2018-01-23T09:07:50-0500] <copumpkin> homing
[2018-01-23T09:08:00-0500] <niksnut> weird
[2018-01-23T09:08:26-0500] <copumpkin> yeah, can't explain it yet based on the code
[2018-01-23T09:11:40-0500] <Dezgeg> maybe without --check it manages to read nsswitch.conf from before entering the sandbox and caches that?
[2018-01-23T09:11:52-0500] <Dezgeg> and/or resolv.conf
[2018-01-23T09:14:19-0500] <copumpkin> now trying to get it to fail in a builder that I control rather than builtin:fetchurl
[2018-01-23T09:15:42-0500] <copumpkin> maybe that'll clarify what's going on
[2018-01-23T09:16:46-0500] <niksnut> copumpkin: have you tried strace?
[2018-01-23T09:17:24-0500] <copumpkin> yup, nothing obvious jumped out at me, but I'll look more carefully soon if this path shows me nothing
[2018-01-23T09:19:21-0500] <niksnut> I would look for connect() calls
[2018-01-23T09:19:30-0500] <niksnut> and grep for nscd
[2018-01-23T09:20:22-0500] <copumpkin> yeah I did, it kept trying to connect to the nscd socket but it wasn't there, iirc
[2018-01-23T09:20:22-0500] <copumpkin> will look again soon
[2018-01-23T09:46:28-0500] <copumpkin> okay, so it doesn't seem to happen for my non-builtin builder with other conditions identical
[2018-01-23T09:46:30-0500] <copumpkin> so I guess strace it is
[2018-01-23T09:54:43-0500] <copumpkin> so I see various messages about trying to connect to nscd socket, but ENOENT
[2018-01-23T09:54:44-0500] <copumpkin> e.g.,
[2018-01-23T09:54:44-0500] <copumpkin> [pid 1347] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[2018-01-23T09:55:09-0500] <copumpkin> same in the child
[2018-01-23T09:55:45-0500] <copumpkin> although it's broken up with other nonsense due to concurrency
[2018-01-23T09:56:09-0500] <copumpkin>  
[2018-01-23T09:56:10-0500] <copumpkin> [pid 1360] stat("/etc/resolv.conf", <unfinished ...>
[2018-01-23T09:56:11-0500] <copumpkin>  [pid 1360] <... open resumed> ) = -1 ENOENT (No such file or directory)
[2018-01-23T09:56:11-0500] <copumpkin> that's also the child I think
[2018-01-23T09:56:25-0500] <copumpkin> shouldn't there be a resolv.conf in the sandbox?
[2018-01-23T09:56:39-0500] <Dezgeg> so no resolv.conf + no nscd = no dns in sandbox then
[2018-01-23T09:57:11-0500] <copumpkin> trying now builtin:fetchurl against direct IP
[2018-01-23T09:57:17-0500] <copumpkin> to see if it's general networking or just DNS
[2018-01-23T09:57:38-0500] <copumpkin> so yeah, just DNS
[2018-01-23T09:59:48-0500] <Dezgeg> is there a resolv.conf outside sandbox?
[2018-01-23T10:01:14-0500] <Dezgeg> is it a symlink?
[2018-01-23T10:11:47-0500] <copumpkin> Dezgeg: it exists and is a regular file :/
[2018-01-23T10:12:50-0500] <copumpkin> not sure why the bot didn't look at it, actually
[2018-01-23T10:12:59-0500] <copumpkin> gchristensen: any idea?
[2018-01-23T10:13:05-0500] <copumpkin> https://github.com/NixOS/nixpkgs/pull/34178
[2018-01-23T10:13:14-0500] <gchristensen> whats up?
[2018-01-23T10:13:34-0500] <copumpkin> https://github.com/NixOS/nixpkgs/pull/34193
[2018-01-23T10:13:53-0500] <copumpkin> gchristensen: just wondering why ofborg didn't notice 34178
[2018-01-23T10:14:01-0500] <gchristensen> hmm
[2018-01-23T10:14:22-0500] <gchristensen> I'm not sure :/ I'll dig in to that shortly
[2018-01-23T10:15:34-0500] <niksnut> copumpkin: maybe you have a non-standard nsswitch.conf?
[2018-01-23T10:16:26-0500] <copumpkin> niksnut: it probably is, yeah, just trying to figure out what's weird about it. This is on AWS CodeBuild, FWIW
[2018-01-23T10:16:40-0500] <copumpkin> so it's inside a docker container
[2018-01-23T10:17:10-0500] <niksnut> https://github.com/NixOS/nix/issues/687
[2018-01-23T10:18:05-0500] <Dezgeg> actually, now that I see that those lines don't match: [pid 1360] stat("/etc/resolv.conf", <unfinished ...> [pid 1360] <... open resumed> ) = -1 ENOENT (No such file or directory)
[2018-01-23T10:18:08-0500] <niksnut> probably we shouldn't bind-mount the host's nsswitch.conf, but install a minimal one
[2018-01-23T10:18:27-0500] <Dezgeg> stat() is unfinished but open() gets ENOENT
[2018-01-23T10:18:55-0500] <copumpkin> niksnut: in my case I had to avoid nscd on purpose because it was somehow going out to the host (as far as I could tell) and getent would fail for nix looking up the build user groups
[2018-01-23T10:19:00-0500] <niksnut> edolstra, We found a potential security vulnerability in a repository for which you have been granted security alert access.
[2018-01-23T10:19:11-0500] <copumpkin> I could probably be more precise about that
[2018-01-23T10:19:15-0500] <niksnut> github complaining about gemfile.lock files again
[2018-01-23T10:19:18-0500] <copumpkin> niksnut: nice!
[2018-01-23T10:19:52-0500] <copumpkin> lol
[2018-01-23T10:20:11-0500] <copumpkin> Dezgeg: yeah the lines might not match up properly, because I'm getting junk from all the concurrent processes with strace -f
[2018-01-23T10:20:21-0500] <copumpkin> is tehre a good way to disentangle those?
[2018-01-23T10:20:25-0500] <Dezgeg> you can split it to one file per process with some strace flag
[2018-01-23T10:24:33-0500] <copumpkin> ah, cool
[2018-01-23T10:24:34-0500] <copumpkin> didn't know about that one
[2018-01-23T10:50:54-0500] <domenkozar> otherwise it can't know?
[2018-01-23T10:52:06-0500] <copumpkin> Dezgeg: okay I now have a per-PID breakdown :P
[2018-01-23T11:18:36-0500] <copumpkin> Dezgeg, niksnut: so I'm looking for write(2, "\1\n", 2) as what happens right before builtinFetchurl
[2018-01-23T11:20:56-0500] <copumpkin> https://pastebin.com/aKf6Ti8t is all I see
[2018-01-23T11:21:57-0500] <copumpkin> there's no nix-daemon involved so I wouldn't expect much IPC, but perhaps the downloader stuff is weird?
[2018-01-23T11:22:41-0500] <copumpkin> I guess it must be in a different process
[2018-01-23T11:22:53-0500] <copumpkin> because I don't see any writes with the warnings about resolving the host name
[2018-01-23T11:28:59-0500] <copumpkin> niksnut: oh, so the actual download happens in the parent process? we seem to fork, then builtinFetchurl in the child somehow gets the parent process to run the actual download? definitely seeing the resolution messages coming from the parent in strace
[2018-01-23T11:30:02-0500] <niksnut> copumpkin: no
[2018-01-23T11:30:09-0500] <niksnut> download is done in the child
[2018-01-23T11:30:21-0500] <copumpkin> well, then this strace is super confusing
[2018-01-23T11:31:06-0500] <copumpkin> because the write(2, "warning: unable to download" is coming from the parent
[2018-01-23T11:31:08-0500] <copumpkin> whereas the final error: download failed comes from teh child
[2018-01-23T11:31:31-0500] <copumpkin> to be clear, I get a few of these: "
[2018-01-23T11:31:31-0500] <copumpkin> warning: unable to download 'http://tarballs.nixos.org/sha256/fd9ca16de1052aac899ad3495ad20dfa906c27b4a5070102a2ec35ca3a4740c1': Couldn't resolve host name (6); retrying in 2608 ms" followed by "
[2018-01-23T11:31:32-0500] <copumpkin> error: unable to download 'http://tukaani.org/xz/xz-5.2.3.tar.bz2': Couldn't resolve host name (6)"
[2018-01-23T11:31:45-0500] <niksnut> copumpkin: isn't it just passing on a message from the child?
[2018-01-23T11:32:06-0500] <copumpkin> I see no corresponding writes on the child process
[2018-01-23T11:32:08-0500] <copumpkin> let me double check
[2018-01-23T11:32:51-0500] <copumpkin> yeah, no writes containing that on the child. I suppose they might be hidden in a longer string that strace truncated
[2018-01-23T11:32:59-0500] <copumpkin> but the child does comparatively little as I showed above
[2018-01-23T11:33:01-0500] <copumpkin> in that paste
[2018-01-23T11:33:14-0500] <copumpkin> you can see the write(2, "\1\n" indicating that we're about to do childish things
[2018-01-23T11:33:19-0500] <copumpkin> and then it just exits
[2018-01-23T11:33:26-0500] <copumpkin> after very little work
[2018-01-23T11:35:53-0500] <niksnut> I see write(2, "error: unable to download 'http:"..., 99) = 99
[2018-01-23T11:36:09-0500] <niksnut> at the end of the trace
[2018-01-23T11:36:24-0500] <copumpkin> yeah
[2018-01-23T11:36:58-0500] <copumpkin> but that's the final error message. But where did those warnings come from?
[2018-01-23T11:37:51-0500] <copumpkin> it retries 8 times (4x tarballs.nixos.org and 4x the original tukaani host) and warns for each of them before giving up with that error
[2018-01-23T11:38:12-0500] <niksnut> probably they came from the download thread
[2018-01-23T11:38:15-0500] <niksnut> try tracing with -f
[2018-01-23T11:39:02-0500] <copumpkin> this is with -f
[2018-01-23T11:39:11-0500] <copumpkin> or rather, -ff splitting all the processes into different output files
[2018-01-23T11:39:20-0500] <copumpkin> because it was impossible to follow otherwise
[2018-01-23T11:39:46-0500] <copumpkin> but the only place I see those warnings come from is the parent process
[2018-01-23T11:39:50-0500] <copumpkin> that's why I'm confused
[2018-01-23T11:40:06-0500] <copumpkin> as I said, they might be buried in a long string that strace is truncating
[2018-01-23T11:40:19-0500] <copumpkin> is there a distinctive syscall I should look for that might indicate the downloader process?
[2018-01-23T11:40:27-0500] <niksnut> clone()
[2018-01-23T11:40:33-0500] <copumpkin> I was using "\1\n" to find the child
[2018-01-23T11:40:36-0500] <copumpkin> ok
[2018-01-23T11:40:50-0500] <niksnut> unfortunately, it's pid 2 because of the pid namespafe
[2018-01-23T11:40:54-0500] <niksnut> namespace
[2018-01-23T11:41:08-0500] <niksnut> so you'll have to map that back to the host pid somehow
[2018-01-23T11:41:33-0500] <copumpkin> I don't think strace pays any attention to that
[2018-01-23T11:41:36-0500] <copumpkin> none of my pids are 2
[2018-01-23T11:41:45-0500] <copumpkin> or you mean the return value :/
[2018-01-23T11:41:50-0500] <niksnut> clone(child_stack=0x7f4252cdaf70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f4252cdb9d0, tls=0x7f4252cdb700, child_tidptr=0x7f4252cdb9d0) = 2
[2018-01-23T11:42:30-0500] <niksnut> oh there's another thread
[2018-01-23T11:42:31-0500] <niksnut> clone(child_stack=0x7f4252cdaf70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f4252cdb9d0, tls=0x7f4252cdb700, child_tidptr=0x7f4252cdb9d0) = 8
[2018-01-23T11:42:34-0500] <copumpkin> I see
[2018-01-23T11:43:16-0500] <niksnut> not sure what that is...
[2018-01-23T11:43:52-0500] <copumpkin> I see a lot of tiny processes that do nothing but set_robust_list(...) and then a bunch of futez calls and then die
[2018-01-23T11:43:56-0500] <copumpkin> futex
[2018-01-23T11:45:04-0500] <copumpkin> so I'm guessing that those clones must be the next processes in order (there's almost nothing happening on this machine)
[2018-01-23T11:45:22-0500] <copumpkin> yeah, that makes sense
[2018-01-23T11:45:45-0500] <copumpkin> https://pastebin.com/e0zDjGf3
[2018-01-23T11:45:54-0500] <copumpkin> I think that's the clone = 2 from above
[2018-01-23T11:46:06-0500] <copumpkin> niksnut: I'm guessing the 8 might be trying tukaani after tarballs.nixos.org failed?
[2018-01-23T11:46:53-0500] <copumpkin>  
[2018-01-23T11:46:54-0500] <copumpkin> openat(AT_FDCWD, "/nix/store/1zv5dwifxg5fh08gif8ld3h9f40y8czh-glibc-2.26-115/lib/libnss_dns.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[2018-01-23T11:47:01-0500] <copumpkin> this looks like an actual downloader
[2018-01-23T11:47:19-0500] <copumpkin> https://pastebin.com/EJ1ur8V2
[2018-01-23T11:47:39-0500] <copumpkin> so it tries nscd socket, looks in resolv.conf, looks in hosts, then tries libnss_dns
[2018-01-23T11:47:42-0500] <copumpkin> and then dies
[2018-01-23T11:49:03-0500] <copumpkin> no attempts to look at nsswitch, it seems
[2018-01-23T11:49:42-0500] <Dezgeg> what's the hosts: line in nsswitch?
[2018-01-23T11:49:55-0500] <niksnut> copumpkin: that file is supposed to exist...
[2018-01-23T11:50:04-0500] <copumpkin> yeah
[2018-01-23T11:50:17-0500] <Dezgeg> not in the chroot
[2018-01-23T11:50:29-0500] <Dezgeg> if you mean libnss_dns.so.2
[2018-01-23T11:50:31-0500] <niksnut> hm
[2018-01-23T11:50:42-0500] <Dezgeg> it's the library used by the executing nix
[2018-01-23T11:51:36-0500] <gchristensen> https://github.com/strace/strace/issues/14
[2018-01-23T11:51:40-0500] <copumpkin> running another build to dump contents of nsswitch.conf
[2018-01-23T11:51:56-0500] <copumpkin> although why would it matter if the child isn't reading it:
[2018-01-23T11:52:00-0500] <Dezgeg> isn't it just the hosts nsswitch?
[2018-01-23T11:52:18-0500] <Dezgeg> probably it got cached before the fork for whatever reason
[2018-01-23T11:52:46-0500] <copumpkin> could something in the parent be reading nsswitch so the child doesn't bother?
[2018-01-23T11:52:51-0500] <copumpkin> yeah maybe
[2018-01-23T11:53:29-0500] <Dezgeg> but if libnss_dns.so.2 is the one that actually connects to the dns servers, then it makes sense why it doesn't work
[2018-01-23T11:55:03-0500] <copumpkin> nsswitch just says "hosts: files dns"
[2018-01-23T11:55:18-0500] <Dezgeg> I suppose the 'files' one then just looks in /etc/hosts
[2018-01-23T11:55:44-0500] <copumpkin> yeah
[2018-01-23T11:55:44-0500] <copumpkin> no compat
[2018-01-23T11:55:53-0500] <copumpkin> but then again I'm not running nscd
[2018-01-23T11:56:06-0500] <Dezgeg> yes
[2018-01-23T11:56:13-0500] <Dezgeg> if you were it'd connect to it and work
[2018-01-23T11:56:27-0500] <copumpkin> so the issue is that the sandbox doesn't have enough stuff in it to allow DNS resolution?
[2018-01-23T11:56:39-0500] <copumpkin> what's unclear to me is why this seems to only happen during --check
[2018-01-23T11:57:01-0500] <copumpkin> maybe it's just a flawed test on my side and the non-check version is getting substituted
[2018-01-23T11:57:15-0500] <Dezgeg> something caused it to be loaded in the parent, like substituter doing dns lookups
[2018-01-23T11:57:34-0500] <copumpkin> oh
[2018-01-23T11:57:38-0500] <copumpkin> so this is fork weirdness
[2018-01-23T11:57:44-0500] <Dezgeg> no
[2018-01-23T11:57:47-0500] <copumpkin> I wish I had a clean repro
[2018-01-23T11:58:01-0500] <copumpkin> it's such a PITA to trigger CodeBuild runs for each test I run
[2018-01-23T12:00:01-0500] <Dezgeg> which way it is? fails when run _with_ --check?
[2018-01-23T12:01:03-0500] <copumpkin> yeah
[2018-01-23T12:01:05-0500] <copumpkin> and only seems to fail when run with --check
[2018-01-23T12:01:19-0500] <Dezgeg> how about without check but with substitutes disabled?
[2018-01-23T12:02:02-0500] <copumpkin> trying it, but IIRC that still worked
[2018-01-23T12:02:41-0500] <Dezgeg> I bet if you strace that, someone loads libnss_dns.so.2 in the parent and the child in the sandbox never loads libnss_dns.so.2
[2018-01-23T12:02:56-0500] <Dezgeg> strace the working build
[2018-01-23T12:03:34-0500] <copumpkin> yeah, give it a few minutes to run :D
[2018-01-23T12:04:54-0500] <copumpkin> although
[2018-01-23T12:05:04-0500] <copumpkin> in one of my pastes above
[2018-01-23T12:05:18-0500] <copumpkin> I pasted the parent process and that opens libnss_dns successfully
[2018-01-23T12:05:29-0500] <copumpkin> before it ever spawns a child
[2018-01-23T12:06:08-0500] <copumpkin> aha, the --check was a red herring, and you're right, if I run it with no --check and no substitutes, I get the same error
[2018-01-23T12:06:10-0500] <copumpkin> that's good
[2018-01-23T12:06:37-0500] <copumpkin> so it seems like builtin:fetchurl just can't resolve hostnames in this environment
[2018-01-23T12:06:48-0500] <Dezgeg> which paste shows that?
[2018-01-23T12:08:15-0500] <copumpkin> sorry, I must've forgotten to paste it
[2018-01-23T12:08:28-0500] <copumpkin> I'll dig up another trace without --check or any confounding factors
[2018-01-23T12:14:19-0500] <copumpkin> aha, I can repro it outside of CodeBuild
[2018-01-23T12:14:21-0500] <copumpkin> this will make experimenting easier
[2018-01-23T12:14:22-0500] <copumpkin> !
[2018-01-23T12:17:47-0500] <copumpkin> okay this time I don't see libnss_dns.so, maybe because I disabled substitutes
[2018-01-23T12:17:49-0500] <copumpkin> in the parent
[2018-01-23T12:31:18-0500] <copumpkin> haha
[2018-01-23T12:31:44-0500] <copumpkin> so now my repro
[2018-01-23T12:31:52-0500] <copumpkin> involves builtins.seq (builtins.fetchurl "https://google.com") ...
[2018-01-23T12:32:05-0500] <copumpkin> if the seq is there, it works
[2018-01-23T12:32:12-0500] <copumpkin> if instead of builtins.fetchurl, I seq `5`, then it fails
[2018-01-23T12:32:19-0500] <copumpkin> niksnut: :P
[2018-01-23T12:33:04-0500] <copumpkin> so I think this is actually a more general bug but is almost always masked by nix doing DNS resolution of its own prior to calling a builder
[2018-01-23T12:33:16-0500] <copumpkin> in my case, being a super narrow build, it didn't do that and failed
[2018-01-23T12:33:54-0500] <copumpkin> it probably also relies on nscd being turned off
[2018-01-23T12:34:38-0500] <Dezgeg> yes
[2018-01-23T12:35:32-0500] <copumpkin> so basically, I'm extremely unlucky :)
[2018-01-23T12:35:41-0500] <copumpkin> but I'm not imagining things
[2018-01-23T12:38:07-0500] <niksnut> copumpkin: hm, I wonder why it's doing DNS resolution. Probably checking the binary cache.
[2018-01-23T12:38:28-0500] <copumpkin> yeah, or in my case (when I disable substitutes), builtins.fetchurl
[2018-01-23T12:38:39-0500] <niksnut> ah right
[2018-01-23T12:39:51-0500] <copumpkin> so the hacky fix would be to just insert a dummy resolution :P
[2018-01-23T12:39:53-0500] <copumpkin> always
[2018-01-23T12:39:54-0500] <copumpkin> but that doesn't feel ideal
[2018-01-23T12:41:38-0500] <copumpkin> maybe always insert nix's glibc into the sandbox?
[2018-01-23T12:41:54-0500] <copumpkin> I think the key issue is that if you're using a builtin builder, you could in theory use all of nix's own closure
[2018-01-23T12:42:03-0500] <copumpkin> in practice builtin:fetchurl is the only one and that doesn't do much
[2018-01-23T12:42:19-0500] <niksnut> the whole point of builtins:fetchurl was that it wouldn't require any sandbox configuration
[2018-01-23T12:42:30-0500] <copumpkin> but if someone were to add another builtin builder, that might be very confusing if it used any more of nix's dependencies
[2018-01-23T12:42:39-0500] <niksnut> we may as well get rid of it if that's not the case
[2018-01-23T12:42:49-0500] <copumpkin> well, I just mean in theory, if builder.isBuiltin then sandbox += nix.closure
[2018-01-23T12:42:54-0500] <copumpkin> hmm
[2018-01-23T12:43:00-0500] <niksnut> there is no nix closure necessarily
[2018-01-23T12:43:10-0500] <copumpkin> oh, because it might be installed in /usr
[2018-01-23T12:43:16-0500] <niksnut> right, or another nix store
[2018-01-23T12:43:30-0500] <copumpkin> hmm
[2018-01-23T12:43:39-0500] <niksnut> but maybe we can just force nss modules to be loaded
[2018-01-23T12:43:46-0500] <niksnut> in the parent
[2018-01-23T12:43:49-0500] MichaelRaskin (~MichaelRa@147.210.22.169) joined the channel
[2018-01-23T12:43:58-0500] <copumpkin> yeah, that might be easiest
[2018-01-23T12:45:38-0500] <copumpkin> nastiest solution of all: whenever the evaluator evaluates something, it implicitly prepends "builtins.sec (buitlins.fetchurl ...)"
[2018-01-23T12:45:38-0500] <copumpkin> :P
[2018-01-23T12:45:49-0500] <copumpkin> seq

@copumpkin
Copy link
Member

copumpkin commented Jun 12, 2018

So it boils down to:

If you use builtins.fetchurl from inside a sandboxed build before you've forced the NSS machinery to load its various libraries into memory, it will fail to resolve any DNS names, because once you're in the sandbox, the relevant libraries (as specified presumably by nsswitch.conf) are not available over the filesystem anymore. Furthermore, we can't really put those libraries into the sandbox easily, so preloading seemed like our best bet.

@dtzWill
Copy link
Member

dtzWill commented Jun 12, 2018

Thank you very much!

I also found this: https://nixos.org/nix-dev/2008-April/000684.html
and this: https://bugzilla.redhat.com/show_bug.cgi?id=481796

@vcunat
Copy link
Member

vcunat commented Jun 12, 2018

Hmm, sounds reasonable. I can't see any mention of reacting to the final dot in man 3 getaddrinfo or man 5 resolv.conf, but I guess that doesn't matter, as the whole hack is meant just for one implementation (glibc).

@dtzWill
Copy link
Member

dtzWill commented Jun 13, 2018

If I understand correctly this is for the case where nscd is disabled, yes?

Finally reproduced this (when removing preloadNSS) using variation of previous command:

$ ./nix-nopreload/bin//nix-build '<nix/fetchurl.nix>' --argstr url "https://example.com?$(cat /dev/urandom | tr -dc 'a-z0-9' | head --bytes 4)" --argstr sha256 08y4734bm2zahw75b16bcmcg587vvyvh0n11gwiyir70divwp1rm --option sandbox true --store $HOME/test-store -vvvvv --option builders "" --option substituters ""

It looks like glibc itself in its testsuite does this: https://sourceware.org/git/?p=glibc.git;a=blob;f=resolv/tst-resolv-res_init-skeleton.c;h=a5061e6d4fb98311e2412d845b2aa01c6ae16202;hb=HEAD#l320
So if using getaddrinfo gives us trouble, we can do something like this which is a bit more "to the point":

dtzWill@ee147d5

Tested with/without to confirm it works. I consider it a feature that this only preloads DNS and not anything else, but maybe I'm missing some use-cases.

@copumpkin
Copy link
Member

The issue with what you're doing is that I don't think the set of libraries is necessarily fixed, and is instead determined by nsswitch.conf. For example, if you use systemd-machined, you can have an nsswitch module for doing easy name resolution against your machine names, implemented by another loadable module.

My preferred solution would be to make the nss_load_all_libraries method non-static upstream, explaining why. In the meantime we could patch it in nixpkgs, but that still won't be great because Nix needs to build against a bunch of glibc versions, so it would be ages until we could assume that function was available. We could also do some fugly hackery to approximate or even replicate its logic here, but none of these options make me very happy.

cc @edolstra in case you have any ideas, since it might not be obvious that this conversation is continuing in this closed ticket.

Speaking of that, @dtzWill perhaps we should open another ticket explaining the situation and soliciting suggestions?

@vcunat
Copy link
Member

vcunat commented Jun 13, 2018

Any requirement to modify glibc probably won't solve this completely anytime soon. I think we need this to work even on various distro glibc versions and NSS configurations.

@copumpkin
Copy link
Member

Yeah, I acknowledged that 😄

that still won't be great because Nix needs to build against a bunch of glibc versions, so it would be ages until we could assume that function was available

I welcome better suggestions though!

@edolstra
Copy link
Member

BTW, having a .invalid name that resolves seems like a violation of RFC 2606, so we could consider a local (mis)configuration issue.

@copumpkin
Copy link
Member

Don't some shady ISPs resolve everything to their "helpfu"l (ad-laden) search page? I remember I had to opt out of it a while back on my Verizon connection 😦

@vcunat
Copy link
Member

vcunat commented Jun 13, 2018

On DNS level it attempts to resolve invalid.yorick.cc. or such, so the (new) RFC doesn't really apply.

EDIT: shady ISPs that put ads or whatever instead of non-existing names are a completely different level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants