Skip to content

Commit

Permalink
nixos xkb-layouts-exist: try to debug on Hydra
Browse files Browse the repository at this point in the history
I hate having to do this.  We're unable to reproduce the problem locally.
  • Loading branch information
vcunat committed Jun 28, 2017
1 parent 607650a commit 2014db3
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions nixos/modules/services/x11/xserver.nix
Expand Up @@ -651,6 +651,7 @@ in
system.extraDependencies = singleton (pkgs.runCommand "xkb-layouts-exist" {
inherit (cfg) layout xkbDir;
} ''
set -x
sed -n -e ':i /^! \(layout\|variant\) *$/ {
:l; n; /^!/bi; s/^ *\([^ ]\+\).*/\1/p; tl
}' "$xkbDir/rules/base.lst" | grep -qxF "$layout" && exec touch "$out"
Expand Down

23 comments on commit 2014db3

@pbogdan
Copy link
Member

@pbogdan pbogdan commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW below would appear to fix the failure for me:

diff --git a/nixos/modules/services/x11/xserver.nix b/nixos/modules/services/x11/xserver.nix
index 01e1659b30..309449afb1 100644
--- a/nixos/modules/services/x11/xserver.nix
+++ b/nixos/modules/services/x11/xserver.nix
@@ -653,8 +653,12 @@ in
     } ''
       sed -n -e ':i /^! \(layout\|variant\) *$/ {
         :l; n; /^!/bi; s/^ *\([^ ]\+\).*/\1/p; tl
-      }' "$xkbDir/rules/base.lst" | grep -qxF "$layout" && exec touch "$out"
+      }' "$xkbDir/rules/base.lst" | grep -qxF "$layout"

+      if [ $? -eq 0 ]; then
+        touch $out
+        exit 0
+      fi
       cat >&2 <<-EOF

       The selected keyboard layout definition does not exist:

From my observations the failure here is non-deterministic as I also had trouble reproducing this initially. After many attempts I managed to reproduce it somewhat reliably however it still behaves somewhat erratically with build randomly succeeding or failing.
For reasons I don't fully understand the above patch makes the build consistent for me, I just completed 100 runs without a failure:

$ while [ $count -le 100 ]; do nix-store --delete /nix/store/vk2m4vpw6janf1hnn80j69c6hzj2w460-xkb-layouts-exist; nix-store -r /nix/store/a81lfnqvb2zn0803ai0snvvlmn49c365-xkb-layouts-exist.drv; count=$((count +1)); done  2>&1 | grep "signal" | wc -l
0

With patch reverted:

$ while [ $count -le 100 ]; do nix-store --delete /nix/store/649fx18rd5kqz83mmvb4ba46a2dpzzr8-xkb-layouts-exist; nix-store -r /nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv; count=$((count +1)); done  2>&1 | grep "signal" | wc -l
31

It seems this might be an issue in nix itself as in the failing cases it appears to be killing the builder process before it exits on its own, hence the signal 9 (Killed) error. The process gets killed after producing full output (touch $out) but before it fully finishes executing. It is unclear to me as to why that occurs and why the behavior is random.
EDIT I'm testing on top of current unstable + xkb commit.

@vcunat
Copy link
Member Author

@vcunat vcunat commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the exec in top-level process confuses nix daemon.

@pbogdan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec confusing things was my hypothesis as well but again not sure as to why :-(. Based on output produced this could be related to this comment and the assumption that the process has already exited, that for some reason doesn't hold in this case.

@vcunat
Copy link
Member Author

@vcunat vcunat commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see why exec should cause EOF, but I'm not so skilled in POSIX/bash details.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess @pbogdan is on the right track here, this looks like some kind of race condition.

In theory this should simulate something similar what @pbogdan has tried but I couldn't reproduce it :-/

for i in $(seq 1000); do
  nix-build -E 'with import <nixpkgs> {}; runCommand "test" { foo = __currentTime; } "foo=\"$(for i in $(seq 4000); do echo bleh; done; echo yep; for i in $(seq 4000); do echo bleh; done)\"; echo \"$foo\" | grep -qxF yep && exec touch $out"' || break;
done

@pbogdan: Which Nix version did you test this on?

@vcunat
Copy link
Member Author

@vcunat vcunat commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it doesn't seem deterministic even on Hydra.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vcunat: The patch from @pbogdan however has the problem that if the grep fails the build process is just failing without the error message. I'm going to push a similar patch in a minute.

@pbogdan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aszlig this is on nix-env (Nix) 1.12pre5413_b4b1f452

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbogdan: Okay, I've tested it on the some version, maybe my machine is not loaded enough to trigger the race condition. Could you try whether you can also reproduce it with the one-liner I mentioned?

@vcunat
Copy link
Member Author

@vcunat vcunat commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early exit with exec wasn't exactly readable to me, too, though perhaps it's about getting used to such stuff.

@pbogdan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aszlig :

$ for i in $(seq 1000); do                                                                                                                                                             ~/nixpkgs
  nix-build -E 'with import <nixpkgs> {}; runCommand "test" { foo = __currentTime; } "foo=\"$(for i in $(seq 4000); do echo bleh; done; echo yep; for i in $(seq 4000); do echo bleh; done)\"; echo \"$foo\" | grep -qxF yep && exec touch $out"' || break;
done
these derivations will be built:
  /nix/store/vask0r01xnph1ign153qnq7zb7v0z926-test.drv
building path(s) ‘/nix/store/cj1nsycipqafs8bl6ywnss9bsirzgvbq-test’
/nix/store/cj1nsycipqafs8bl6ywnss9bsirzgvbq-test
/nix/store/cj1nsycipqafs8bl6ywnss9bsirzgvbq-test
these derivations will be built:
  /nix/store/v6pav5d8k20hhdicfsl6xlj3vnrm3fgj-test.drv
building path(s) ‘/nix/store/a213rhq4anqivvlrb2dgy3xcb9js187i-test’
builder for ‘/nix/store/v6pav5d8k20hhdicfsl6xlj3vnrm3fgj-test.drv’ failed due to signal 9 (Killed)
error: build of ‘/nix/store/v6pav5d8k20hhdicfsl6xlj3vnrm3fgj-test.drv’ failed

Behavior is random, I can get few iterations to complete but eventually it fails before completing a full run.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed in e82d126.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this clearly was just a workaround. I really think we should fix this in Nix, but we first need to figure out why exactly this happens. @pbogdan: Please let me know whether you could reproduce this.

@vcunat
Copy link
Member Author

@vcunat vcunat commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cooperation ❤️

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbogdan: Ah, sorry, didn't see your comment.

This is an strace -e trace=close of the test case:

building path(s) ‘/nix/store/6vnppd42h815gj24x4sybi2bcd2n9db0-test’
close(3)                                = 0
close(3)                                = 0
/nix/store/c79jgqd54690bz00ira4ciskn98ndxcp-strace-4.17/bin/strace: Process 38 attached
[pid    37] close(4)                    = 0
[pid    37] close(4)                    = -1 EBADF (Bad file descriptor)
/nix/store/c79jgqd54690bz00ira4ciskn98ndxcp-strace-4.17/bin/strace: Process 39 attached
[pid    37] close(3)                    = 0
[pid    39] close(3)                    = 0
[pid    38] close(3)                    = 0
[pid    38] close(4)                    = 0
/nix/store/c79jgqd54690bz00ira4ciskn98ndxcp-strace-4.17/bin/strace: Process 40 attached
[pid    38] close(4)                    = 0
[pid    40] close(4)                    = 0
[pid    40] close(3)                    = 0
[pid    39] close(3)                    = 0
/nix/store/c79jgqd54690bz00ira4ciskn98ndxcp-strace-4.17/bin/strace: Process 41 attached
[pid    39] close(3)                    = 0
[pid    39] close(3)                    = 0
[pid    41] close(3)                    = 0
[pid    41] close(3)                    = 0
[pid    41] close(3)                    = 0
[pid    41] close(3)                    = 0
[pid    41] close(3)                    = 0
[pid    41] close(1)                    = 0
[pid    41] close(2)                    = 0
[pid    41] +++ exited with 0 +++
[pid    40] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=41, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid    40] +++ exited with 0 +++
[pid    38] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=40, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid    38] close(3)                    = 0
[pid    39] close(1)                    = 0
[pid    39] close(2)                    = 0
[pid    38] +++ exited with 0 +++
[pid    39] +++ exited with 0 +++
close(3)                                = -1 EBADF (Bad file descriptor)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=38, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
close(3)                                = 0
close(3)                                = 0
close(3)                                = 0
close(3)                                = 0
close(3)                                = 0
close(3)                                = 0
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
+++ exited with 0 +++
/nix/store/6vnppd42h815gj24x4sybi2bcd2n9db0-test

So it seems that fd 1/2 have been closed before the actual process substitution, but the close was within a subprocess so in theory it shouldn't have any effect on the builder.

@pbogdan
Copy link
Member

@pbogdan pbogdan commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested and the original commit + patch passes 100 iterations 😄 . Here's what I was seeing with just the original commit in case it helps:

Success:

20082 prctl(PR_SET_NAME, "touch")       = 0
20082 prctl(PR_SET_KEEPCAPS, 140737488348680) = -1 EINVAL (Invalid argument)
20082 brk(NULL)                         = 0x780000
20082 brk(0x7a1000)                     = 0x7a1000
20082 open("/nix/store/649fx18rd5kqz83mmvb4ba46a2dpzzr8-xkb-layouts-exist", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
20082 dup2(3, 0)                        = 0
20082 close(3)                          = 0
20082 utimensat(0, NULL, NULL, 0)       = 0
20082 close(0)                          = 0
20082 close(1)                          = 0
20082 close(2)                          = 0
20075 <... select resumed> )            = 1 (in [12])
20082 exit_group(0)                     = ?
20075 clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
20082 +++ exited with 0 +++
20075 <... clock_gettime resumed> {tv_sec=15798, tv_nsec=15589833}) = 0
20075 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=20082, si_uid=30001, si_status=0, si_utime=2, si_stime=3} ---
20075 read(12, "", 4096)                = 0
20075 write(2, "building of \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231: got EOF\n", 93) = 93
20075 write(2, "building of \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231: woken up\n", 94) = 94
20075 write(2, "building of \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231: build done\n", 96) = 96
20075 write(2, "killing process 20082\n", 22) = 22
20075 kill(-20082, SIGKILL)             = 0
20075 wait4(20082, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 20082
20075 write(2, "builder process for \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231 finished\n", 101) = 101
20075 close(12)                         = 0
20075 write(11, "BZh9\27rE8P\220\0\0\0\0", 14) = 14
20075 close(11)                         = 0
20075 munmap(0x7f53216c6000, 3600384)   = 0
20075 munmap(0x7f5321357000, 3600384)   = 0
20075 munmap(0x7f5327092000, 266240)    = 0
20075 write(2, "killing all processes running under uid \342\200\23030001\342\200\231\n", 52) = 52
20075 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f53271486d0) = 20115
20075 wait4(20115,  <unfinished ...>
20115 set_robust_list(0x7f53271486e0, 24) = 0
20115 ioctl(2, TCGETS, {c_iflags=0x4500, c_oflags=0x5, c_cflags=0xbf, c_lflags=0x8a3b, c_line=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0
20115 prctl(PR_SET_PDEATHSIG, SIGKILL)  = 0
20115 setuid(30001)                     = 0
20115 kill(-1, SIGKILL)                 = 0
20115 exit_group(0)                     = ?
20115 +++ exited with 0 +++

Failure:

19820 prctl(PR_SET_NAME, "touch")       = 0
19820 prctl(PR_SET_KEEPCAPS, 140737488348680) = -1 EINVAL (Invalid argument)
19820 brk(NULL)                         = 0x780000
19820 brk(0x7a1000)                     = 0x7a1000
19820 open("/nix/store/649fx18rd5kqz83mmvb4ba46a2dpzzr8-xkb-layouts-exist", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
19820 dup2(3, 0)                        = 0
19820 close(3)                          = 0
19820 utimensat(0, NULL, NULL, 0)       = 0
19820 close(0)                          = 0
19820 close(1)                          = 0
19820 close(2 <unfinished ...>
19813 <... select resumed> )            = 1 (in [12])
19813 clock_gettime(CLOCK_MONOTONIC, {tv_sec=15721, tv_nsec=329841157}) = 0
19813 read(12, "", 4096)                = 0
19813 write(2, "building of \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231: got EOF\n", 93) = 93
19813 write(2, "building of \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231: woken up\n", 94) = 94
19813 write(2, "building of \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231: build done\n", 96) = 96
19813 write(2, "killing process 19820\n", 22) = 22
19813 kill(-19820, SIGKILL)             = 0
19813 wait4(19820,  <unfinished ...>
19820 <... close resumed>)              = ?
19820 +++ killed by SIGKILL +++
19813 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 19820
19813 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=19820, si_uid=30001, si_status=SIGKILL, si_utime=3, si_stime=3} ---
19813 write(2, "builder process for \342\200\230/nix/store/q41q3pnmbjp0z8q9484vrrsbcvpv8k46-xkb-layouts-exist.drv\342\200\231 finished\n", 101) = 101
19813 close(12)                         = 0
19813 write(11, "BZh9\27rE8P\220\0\0\0\0", 14) = 14
19813 close(11)                         = 0
19813 munmap(0x7f6cb0272000, 3600384)   = 0
19813 munmap(0x7f6cabc91000, 3600384)   = 0
19813 munmap(0x7f6cb5c3e000, 266240)    = 0
19813 write(2, "killing all processes running under uid \342\200\23030001\342\200\231\n", 52) = 52
19813 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f6cb5cf46d0) = 19853
19813 wait4(19853,  <unfinished ...>
19853 set_robust_list(0x7f6cb5cf46e0, 24) = 0
19853 ioctl(2, TCGETS, {c_iflags=0x4500, c_oflags=0x5, c_cflags=0xbf, c_lflags=0x8a3b, c_line=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0
19853 prctl(PR_SET_PDEATHSIG, SIGKILL)  = 0
19853 setuid(30001)                     = 0
19853 kill(-1, SIGKILL)                 = 0
19853 exit_group(0)                     = ?
19853 +++ exited with 0 +++

EDIT above is a build without sandboxing. I verified sandboxing fails in the same way but didn't capture strace output for that.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbogdan: Oh nice! So it seems Nix only cares about FD 1 and because it is closed before FD 2, it kills the process.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but it doesn't seem to have to do with this, because stderr is dup()ed to stderr. I'll write up a Nix issue.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vcunat
Copy link
Member Author

@vcunat vcunat commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So touch presumably closes stdout immediately because it doesn't need it?

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vcunat: Not immediately, but on exit.

@aszlig
Copy link
Member

@aszlig aszlig commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vcunat
Copy link
Member Author

@vcunat vcunat commented on 2014db3 Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I probably don't want to follow that fd juggling with fd_reopen, etc. It's enough here to know that it's closed "prematurely".

Please sign in to comment.