New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nixos/tests/cloud-init: fix the erroring out #105469
Conversation
cc @nlewo as maintainer |
@GrahamcOfBorg test cloud-init |
@kisik21 the test is actually flaky. |
I think I've noticed it hanging once before, but disregarded that. Thought it's a minor glitch. What do we usually do with flaky tests? How could we stabilize them? |
We would need to understand why it is flaky and then fix it. I didn't really look at the failure, but it could be because the cloud-init process is still initializing while the test is already running. |
@nlewo I think I caught it! It hangs on the line: |
Ok I grokked how to debug these with the interactive runners |
90ae8fd
to
6af088e
Compare
@nlewo apparently it's not flaky anymore, after I slapped a time.sleep in a place that felt right after some inspection. Tested at least 10 times in a row. Maybe around a 100 times, but my terminal window magically closed, trashing all the test results with it - but I saw it happening. |
Ok, that's a nice starting point: now, we know we have to wait for a service! Do you know which one ( |
Maybe we should wait for multi-user.target instead of sshd? I'll try that once I checkout that branch again locally :3 |
Initial testing results are promising, it doesn't fail. |
6af088e
to
071993f
Compare
Hundred iterations and it didn't fail once. I think we're clear |
@GrahamcOfBorg test cloud-init |
@kisik21 I actually think waiting for Could you try on your machine as well? @GrahamcOfBorg test cloud-init |
I launched it. Will take a while. |
a8317d6
to
b5d713b
Compare
@nlewo It works, no timeouts on my machine. I squashed the commits and credited you as a co-author. |
nixos/tests/cloud-init.nix
Outdated
unnamed.wait_for_unit("cloud-init.service") | ||
|
||
# To wait until cloud-init terminates its run | ||
unnamed.wait_until_succeeds("ls /run/cloud-init/result.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hacky. I'd really prefer if we could wait for the the service that's supposed to be creating this. Is this part of cloud-init.service
(On which we already wait in L54), or what other unit handles the write_files
part?
Maybe we just need to wait on another one of the services? cloud-final.service
maybe, or just the entire cloud-config.target
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could work, but I'm not a cloud-init pro. I just wanted to fix a crashing test script and maaaaybe learn a bit of cloud-init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you can take a look at nixos/modules/services/system/cloud-init.nix
, and how we invoke with various parameters in different units, then see where the write_files
modules is handled inside cloud-init.
It's probably easier to just wait on cloud-config.target
, which should include all of them (I think)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like cloud-config.target includes write_files. I'm launching 200 iterations of the unit test with timeout of 120 seconds (it usually completes in 70 seconds on my machine) to test it. I guess I'll report results the next morning!
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, it looks like the test became more flaky. I guess I'll have to order after cloud-final - it includes things not included in cloud-config.target!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after ordering the test after cloud-final.service
it looks like everything's fine. I'll leave it to run overnight to see if there are any failures. If they are, I shall be greeted with a huge error on my console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, would be better to be able to wait on an unit.
Oh, and i actually learnt that for oneshot services, systemd waits for the process termination before continuing! That could explain why it was working when you were waiting |
It also works when waiting for |
The test was broken for a 1.5 months apparently? Well, now it passes. Also apparently it's not flaky anymore.
b5d713b
to
bcc196c
Compare
This is fine :-D Thanks a lot for fixing it! |
Motivation for this change
The test was broken for a 1.5 months apparently? Well, now it passes. I wonder how this happened...
Things done
sandbox
innix.conf
on non-NixOS linux)nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
./result/bin/
)nix path-info -S
before and after)