Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/tests/cloud-init: fix the erroring out #105469

Merged
merged 1 commit into from Dec 3, 2020

Conversation

vikanezrimaya
Copy link
Member

Motivation for this change

The test was broken for a 1.5 months apparently? Well, now it passes. I wonder how this happened...

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@vikanezrimaya
Copy link
Member Author

cc @nlewo as maintainer

@nlewo
Copy link
Member

nlewo commented Nov 30, 2020

@GrahamcOfBorg test cloud-init

@nlewo
Copy link
Member

nlewo commented Nov 30, 2020

@kisik21 the test is actually flaky.
It timeouts on the CI and locally, sometimes it passes and sometimes it timeouts.
Could you try to run it several times on your machine (you can use nix-build --check)?

@vikanezrimaya
Copy link
Member Author

I think I've noticed it hanging once before, but disregarded that. Thought it's a minor glitch.

What do we usually do with flaky tests? How could we stabilize them?

@nlewo
Copy link
Member

nlewo commented Nov 30, 2020

We would need to understand why it is flaky and then fix it.
I would first try to reproduce interactively (nix-build -A the-test.driver). If it is not possible to reproduce interactively, I would then try to put some sleep during the test to see if it solves the issue. If it solves the issue, I would then identify which component needs to be waited and then find a way to know when this component is actually ready.

I didn't really look at the failure, but it could be because the cloud-init process is still initializing while the test is already running.

@vikanezrimaya
Copy link
Member Author

@nlewo I think I caught it! It hangs on the line: systemd[1]: Startup finished in xxxxs (kernel) + yyyys (userspace) = zm z's. I wonder what's happening inside of the VM?

@vikanezrimaya
Copy link
Member Author

Ok I grokked how to debug these with the interactive runners

@vikanezrimaya
Copy link
Member Author

@nlewo apparently it's not flaky anymore, after I slapped a time.sleep in a place that felt right after some inspection. Tested at least 10 times in a row. Maybe around a 100 times, but my terminal window magically closed, trashing all the test results with it - but I saw it happening.

@nlewo
Copy link
Member

nlewo commented Dec 2, 2020

Ok, that's a nice starting point: now, we know we have to wait for a service! Do you know which one (cloud-init or sshd)?
Because we would be much better to avoid a sleep in the test (it consumes unneeded seconds). Instead, we should find something to detect the service is ready.

@vikanezrimaya
Copy link
Member Author

Maybe we should wait for multi-user.target instead of sshd? I'll try that once I checkout that branch again locally :3

@vikanezrimaya
Copy link
Member Author

Initial testing results are promising, it doesn't fail.

@vikanezrimaya
Copy link
Member Author

Hundred iterations and it didn't fail once. I think we're clear

@nlewo
Copy link
Member

nlewo commented Dec 2, 2020

@GrahamcOfBorg test cloud-init

@nlewo
Copy link
Member

nlewo commented Dec 2, 2020

@kisik21 I actually think waiting for multi-user.target works by chance. I think the issue is that cloud-init is writing sshd keys while the ssh connection is creating.
I added a statement ensuring cloud-init terminates its run before continuing.
I run it several time on my laptop and I have not been able to reproduce the issue.

Could you try on your machine as well?

@GrahamcOfBorg test cloud-init

@vikanezrimaya
Copy link
Member Author

I launched it. Will take a while.

@vikanezrimaya
Copy link
Member Author

@nlewo It works, no timeouts on my machine. I squashed the commits and credited you as a co-author.

nixos/tests/cloud-init.nix Outdated Show resolved Hide resolved
unnamed.wait_for_unit("cloud-init.service")

# To wait until cloud-init terminates its run
unnamed.wait_until_succeeds("ls /run/cloud-init/result.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky. I'd really prefer if we could wait for the the service that's supposed to be creating this. Is this part of cloud-init.service (On which we already wait in L54), or what other unit handles the write_files part?

Maybe we just need to wait on another one of the services? cloud-final.service maybe, or just the entire cloud-config.target?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could work, but I'm not a cloud-init pro. I just wanted to fix a crashing test script and maaaaybe learn a bit of cloud-init?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can take a look at nixos/modules/services/system/cloud-init.nix, and how we invoke with various parameters in different units, then see where the write_files modules is handled inside cloud-init.

It's probably easier to just wait on cloud-config.target, which should include all of them (I think)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like cloud-config.target includes write_files. I'm launching 200 iterations of the unit test with timeout of 120 seconds (it usually completes in 70 seconds on my machine) to test it. I guess I'll report results the next morning!

This comment was marked as outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, it looks like the test became more flaky. I guess I'll have to order after cloud-final - it includes things not included in cloud-config.target!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after ordering the test after cloud-final.service it looks like everything's fine. I'll leave it to run overnight to see if there are any failures. If they are, I shall be greeted with a huge error on my console.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, would be better to be able to wait on an unit.

@nlewo
Copy link
Member

nlewo commented Dec 3, 2020

Oh, and i actually learnt that for oneshot services, systemd waits for the process termination before continuing! That could explain why it was working when you were waiting multi-user.target!

@vikanezrimaya
Copy link
Member Author

vikanezrimaya commented Dec 3, 2020

It also works when waiting for cloud-final.service. But now I'm confused as to who to credit in Co-Authored-By trailers: you both or one of you? i wanna do the most sensible thing but am confused 😄 We edited the same lines three times, and now the authorship is unclear. Anyway, I'll push for now

The test was broken for a 1.5 months apparently? Well, now it passes.
Also apparently it's not flaky anymore.
@flokli
Copy link
Contributor

flokli commented Dec 3, 2020

This is fine :-D Thanks a lot for fixing it!

@flokli flokli merged commit 84f417d into NixOS:master Dec 3, 2020
@vikanezrimaya vikanezrimaya deleted the fix-cloud-init-test branch December 3, 2020 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants