New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NixOS EC2 AMI: Support IMDSv2 #104193
NixOS EC2 AMI: Support IMDSv2 #104193
Conversation
These two APIs have diverged over time and are no longer compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thank you for digging into this 👍
When you are done can you ensure that the commit messages have an equal amount of information as the PR description has right now? This will make it much easier for someone to debug this in the future as most editors support showing commit messages inline.
6badc53
to
b1c1087
Compare
Done, @andir. I've tested this with real AWS machines and it works correctly, fixing the issue preventing our AMI from working with IMDSv2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't run this but the solution really looks straight forward. Given that this is for AWS instances and not for my kettle's IoT interface the additional 30M do not seem too bad. If, in fact, this becomes an issue we could still use a slimmer version of curl or revisit our options here.
There are a few comments that I'd like to see clarified before I can "approve" this.
Oh, I see my review was requested. Is the goal to have this in the stable distro? |
Yeah, I want to backport it so NixOS on AWS users can improve the security of their instances. |
Got a +1 from @andir on IRC. |
FYI @endgame and @jonringer , this implements #96593 a bit differently I think. |
I provided Amine a copy of the AMI to test with NixOps, so let's merge once that is ready. |
AWS's metadata service has two versions. Version 1 allowed plain HTTP requests to get metadata. However, this was frequently abused when a user could trick an AWS-hosted server in to proxying requests to the metadata service. Since the metadata service is frequently used to generate AWS access keys, this is pretty gnarly. Version two is identical except it requires the caller to request a token and provide it on each request. Today, starting a NixOS AMI in EC2 where the metadata service is configured to only allow v2 requests fails: the user's SSH key is not placed, and configuration provided by the user-data is not applied. The server is useless. This patch addresses that. Note the dependency on curl is not a joyful one, and it expand the initrd by 30M. However, see the added comment for more information about why this is needed. Note the idea of using `echo` and `nc` are laughable. Don't do that.
e8551fd
to
ca95acf
Compare
try=1 | ||
while [ $try -le 10 ]; do | ||
echo "(attempt $try/10) validating the EC2 instance metadata service v2 token..." | ||
preflight_imds_token && break | ||
try=$((try + 1)) | ||
sleep 1 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a preflight check of the IMDS token, as apparently sometimes the instance metadata service takes a second to figure itself out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we do not need more of these fallbacks.. It will soon escape the scope of where a shell script appears to be the right thing to me.
The commit message indicates it can take minutes for the metadata service to respond properly. Yet we are only waiting for to 30s for a response. Is that a trade-off between slow boot / unusable machines?
I wonder if the metadata service might also fail to hand out the "current" token during startup. Since we are already retrieving the instance-id should be validate something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we talked on IRC and it is moment not minutes. I guess 30s are fine then?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"minutes" was a mistake on my part. I updated it to say "moments". I too hope we don't need more, it is already growing to the limits of comfort. I think once we get evidence of a working token, we're good to go, but I'm happy to be shown wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andir I'm not sure there's a "current" token:
https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/ under "what's new in IMDSv2":
The token is never stored by IMDSv2 and can never be retrieved by subsequent calls, so a session and its token are effectively destroyed when the process using the token terminates. There’s no limit on the number of requests within a single session, and there’s no limit on the number of IMDSv2 sessions.
So guess it generates some cryptographic doohickey each time you PUT?
According to Freenode's ##AWS, the metadata server can sometimes take a few moments to get its shoes on, and the very first boot of a machine can see failed requests for a few moments.
ca95acf
to
0d87ce6
Compare
preflight_imds_token() { | ||
# retry-delay of 1 selected to give the system a second to get going, | ||
# but not add a lot to the bootup time | ||
${curl}/bin/curl \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we might consider setting PATH=${curl}/bin:$PATH
or something. This will safe use ~100 bytes or so ;-)
try=1 | ||
while [ $try -le 10 ]; do | ||
echo "(attempt $try/10) validating the EC2 instance metadata service v2 token..." | ||
preflight_imds_token && break | ||
try=$((try + 1)) | ||
sleep 1 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we do not need more of these fallbacks.. It will soon escape the scope of where a shell script appears to be the right thing to me.
The commit message indicates it can take minutes for the metadata service to respond properly. Yet we are only waiting for to 30s for a response. Is that a trade-off between slow boot / unusable machines?
I wonder if the metadata service might also fail to hand out the "current" token during startup. Since we are already retrieving the instance-id should be validate something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks @grahamc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the ping. There are a couple of small things that I think need to change. Fixing IMDSv2 on nixos was actually a yakshave; I need the public IPv4 address from IMDS for something else I'm working on, and tried to fix IMDS while I was there.
Also, the metadata dir should probably cleared on each boot, because otherwise it can become outdated w.r.t actions people can take in the management console.
I think there's enough difference in #96593 that it's probably worth @grahamc having a look and seeing if there's anything he wants to lift: I constructed the PUT request using busybox nc
, so I don't pull additional packages into initramfs, and the fetcher is parameterised so the EC2/openstack common parts have been DRYed up. Some of the comments might also be worth pulling across. I don't have a preflight check on the IMDS, though.
Sidenote: I must confess that I'm a little frustrated by what feels like a process failure here: I'd been trying to get help testing my PR for months, and there's so much to do on NixOS that duplicating work isn't great. But at the end of the day, progress is excellent and I'm glad to see this about to land 👍 🎉
if ! [ -e "$metaDir/ami-manifest-path" ]; then | ||
wget ${wgetExtraOptions} -O "$metaDir/ami-manifest-path" http://169.254.169.254/1.0/meta-data/ami-manifest-path | ||
fi | ||
|
||
if ! [ -e "$metaDir/user-data" ]; then | ||
wget ${wgetExtraOptions} -O "$metaDir/user-data" http://169.254.169.254/1.0/user-data && chmod 600 "$metaDir/user-data" | ||
fi | ||
|
||
if ! [ -e "$metaDir/hostname" ]; then | ||
wget ${wgetExtraOptions} -O "$metaDir/hostname" http://169.254.169.254/1.0/meta-data/hostname | ||
fi | ||
|
||
if ! [ -e "$metaDir/public-keys-0-openssh-key" ]; then | ||
wget ${wgetExtraOptions} -O "$metaDir/public-keys-0-openssh-key" http://169.254.169.254/1.0/meta-data/public-keys/0/openssh-key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please start using http://169.254.169.254/2009-04-04
and download the public IPv4 address? This is a version that openstack and ec2 both support (I checked the docs), and supports the http://169.254.169.254/2009-04-04/public-ipv4
endpoint which i need for some other stuff.
wget ${wgetExtraOptions} -O "$metaDir/ami-manifest-path" http://169.254.169.254/1.0/meta-data/ami-manifest-path | ||
fi | ||
|
||
if ! [ -e "$metaDir/user-data" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both ec2 and openstack, I think these calls should not be guarded, and that it is correct to unconditionally replace fetched files. It might even be correct to clear the contents of metaDir
on boot and re-fetch everything.
Example: currently, I can stop an EC2 instance in the management console, change its user data, and reboot. Then the updated user data is not applied. I think it's also possible to change image key pairs, and if I change the machine's subnet to start or stop assigning public IP addresses, then those files could appear or disappear.
{ curl, targetRoot, wgetExtraOptions }: | ||
# Note: be very cautious about dependencies, each dependency grows | ||
# the closure of the initrd. Ideally we would not even require curl, | ||
# but there is no reasonable way to send an HTTP PUT request without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be "reasonable", but you can construct a PUT with busybox nc
:
echo -ne 'PUT /latest/api/token HTTP/1.1\r\nHost: 169.254.169.254\r\nX-aws-ec2-metadata-token-ttl-seconds: 60\r\n\r\n' | nc 169.254.169.254 80 | tail -n 1
I wonder why it needs to be part of the very early boot process, and not fetched as part of a service pre-start or something? We could add it, but I fear growing what is already a fairly gnarly script in a gnarly and unfortunate location.
This sounds like a good idea, but I'm a bit unsure about backporting to stable. It would be a pretty big change in behavior, potentially breaking workflows. Maybe an additional PR which only hits master, plus release notes. What do you think?
I think the HTTP protocol is complicated enough to deserve using curl, and that implementing it with
Since they are different APIs with different implementations, I don't like the idea of trying to make it "dry" -- they're doing different things with APIs that look similar but evolve differently. I thought it made sense to split out the two, creating a less dynamic script, as opposed to a highly parameterized script in stage-1. I'd rather a bit of duplication here, as opposed to risking subtle bugs from dryness.
Yes. I agree with you. |
Thanks for the quick response @grahamc .
I have a service that needs to know its public IP address, and was setting it up as a NixOS module. So I want that file present on disk so that I can refer to it in
Agree with your suggestion - this should happen later and go into release notes, so we don't blow up someone's cloud deployment by accident. I don't think that's that likely, but likelihood × severity = risk, and it's high enough that it's worth being cautious. I can do a separate PR for this.
Openstack explicitly advertises their IMDS as EC2-compatible. Though I'm not sure whether they have a token service on their IMDS, and maybe we say that means they've failed in that goal? In any case, maybe add a comment to keep the two fetchers in "reasonable" sync? |
re the IP address: would it be possible to define your service like this? :
If you can send that PR shortly after this one, we can merge it soon while we're all on top of it. BTW I've started #nixos-cloud on Freenode if you're interested in that sort of thing. w.r.t. Openstack, I've found it really hit or miss if a deployment has an up to date set of components, or has even configured and deployed all the components. The variability there leaves me grayer than I was before ... 😬 |
Pending the discussion here, a backport PR: #104302 |
Probably; I hadn't thought about doing it that way. It's a low-pressure side project so I'll give it a go at some point. Definitely should not hold up this PR. I'll look at clearing out the metadata directory on boot this weekend, hopefully. That's all my concerns addressed, I think. Should we ship this, then? |
Thanks, @endgame! Let's go! |
fixup breakage from NixOS#104193 (cherry picked from commit 1ef139f)
Motivation for this change
AWS's metadata service has two versions. Version 1 allowed plain HTTP requests to get metadata. However, this was frequently abused when a user could trick an AWS-hosted server in to proxying requests to the metadata service. Since the metadata service is frequently used to generate AWS access keys, this is pretty gnarly. Version two is identical except it requires the caller to request a token and provide it on each request.
Today, starting a NixOS AMI in EC2 where the metadata service is configured to only allow v2 requests fails: the user's SSH key is to placed, and configuration provided by the user-data is not applied. The server is useless. This patch addresses that.
OpenStack "borrowed" this metadata fetcher when it happened to be the same API, but this new version is different enough, imo, to warrant undoing this borrow.
This PR is not yet tested in an actual deployment, and I'll mark it as ready once I have. It should be backported to stable and new AMIs should be built.Things done
sandbox
innix.conf
on non-NixOS linux)nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
./result/bin/
)nix path-info -S
before and after)