New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpython: Use optimizations, for a 25% speedup. #84072
Conversation
The ./configure script prints a warning when passed this flag, starting with 3.7: configure: WARNING: unrecognized options: --with-threads The reason is that there's no longer such a thing as a build without threads. Eliminate the warning, by only passing the flag on the older releases that accept it. Upstream change and discussion: python/cpython@a6a4dc816 https://bugs.python.org/issue31370
@GrahamcOfBorg build python2 |
How much time does it take to build python with and without? |
@jonringer cpython 3.7 is used for bootstrapping, ofborg won't like that 😛 |
On my desktop, building That's with significant parallelism available, and so the majority of the additional wall-clock time is for running the profiling task, which runs without parallelism; that step contributed 133s (2m13s). |
Earlier PR #43442. |
Ah, yeah -- I saw that one, and should have linked to it here. That PR was to turn on one of these flags, It looks like there weren't any benchmarks done in that thread; I think the benchmark results turn out to make a strong case that these optimizations are valuable. The impact on build time was also much steeper in that version, because the upstream work to develop the new short profile task hadn't yet happened then. |
|
Is the memory consumptions for this build still reasonable? |
What is "reasonable"? I wouldn't have noticed unless it had been more than a few GB, you'd need to re-measure for an accurate figure (adding the |
I would say if it is still build-able on a machine with 4GB it should be fine. |
I tried adding Which surprised me! That's what I found the other day on an early draft of this PR, too: a Python built with PGO + LTO runs a bit slower than one with PGO alone, which is why I left LTO out of this PR. But I measured it again with a fresh build just now, and got the same surprising result. Detailed table:
So 7/60 benchmarks get faster; 21/60 no change; 32/60 get slower. The median is 1% slower, and the quartiles 3% slower to 0%. The extremes are 9% slower and 4% faster. OTOH without PGO, starting from a baseline of the expression in nixpkgs master, LTO is quite helpful:
Median 4% faster; quartiles 3% to 7% faster; extremes 1% slower to 17% faster (ignoring one benchmark reading >4x slower, where the timing clearly hiccupped.) So if you were comparing If you were comparing |
@gnprice: Many thanks for the detailed (!) response! Let's just go for I'd love to post more detailed data, but unfortunately, it will take me a while to access the machine where I ran the benchmarks due to lockdown-related silliness. |
@@ -0,0 +1,30 @@ | |||
Call the linker with -O1 -Bsymbolic-functions. | |||
|
|||
Based on debian/patches/link-opt.diff in the Debian package `python3`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a link to salsa and list the changes from the Debian patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Here's the patch on salsa (as of latest master):
https://salsa.debian.org/cpython-team/python3/-/blob/02735af38/debian/patches/link-opt.diff
This patch has exactly the same content -- the only change is in the surrounding context.
Similarly, within Debian, the patch's content has been unchanged (the only changes to the patch file have been to update the context and line numbers) since LDCXXSHARED was introduced to the upstream version of this code in Python 3.2 (and 2.7). Here's that patch as it appeared for python3.2
in Debian 7 Wheezy, released in 2013:
https://sources.debian.org/src/python3.2/3.2.3-7/debian/patches/link-opt.diff/
And here it is from the python2.4
in Debian 5 Lenny, released in 2009:
https://sources.debian.org/src/python2.4/2.4.6-1+lenny1/debian/patches/link-opt.dpatch/
That one already has the same content except for the absence of LDCXXSHARED.
I'll add into the patch files the Salsa link, and the clarification of "based on" to "has the same content".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Oh, and the Debian versions of the patch apply to configure.ac
, or on older CPython versions configure.in
, because Debian makes a point of regenerating configure
from those.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regenerating configure
is not a bad idea either, just add the hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat -- searching a bit, I guess that'd be autoreconfHook
? And we could enable it by just adding it to nativeBuildInputs
, IIUC. Cool.
Happy to add a commit making that change too. I won't do that tonight, only because testing properly causes rebuilds of GCC etc. that take over an hour, and it's too late in the evening here for another round of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the patches from Debian apply, it is better to fetch them using
fetchpatch
.
OK, can do.
Any preference on fetching from Salsa vs. sources.debian.org? (The latter being a record of actual releases.)
I think I can find all 4 of the needed versions of the patch (the context keeps changing slightly) on Salsa, and 3 of the 4 on sources -- Debian skipped Python 3.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Salsa is better since it is a git repository – we can pin it to a commit and it will always be there. Sources might remove older debian releases after they are no longer supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, that's a good consideration. On that score, I think actually sources is more dependable -- it isn't the live repository for active releases, rather is meant as a permanent archive for studying long-term history. It has all Debian releases going back to 2.0 "hamm", released in 1998:
https://sources.debian.org/stats/
These patches are probably pretty stable in their URLs on Salsa, too. But they could go away -- the package's maintainer, or a future new maintainer, could move its development to some other repo and remove this one. If it's simply moved within Salsa there'd probably be a redirect, but they could also move it to some other Git hosting service or even something else. Anything the particular maintainer finds convenient for ongoing development would be seen as 100% OK from a Debian perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, sources only carry the latest revision they have for a given Debian release:
https://sources.debian.org/src/python3.2/
Maybe there is some archive site but I do not see revision -6
in the above link.
Salsa should be stable enough, we can always deal with breakage like we did when they switched from Alioth to Salsa. And Hydra caches the sources so even if they switch, users will still get it from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, sources only carry the latest revision they have for a given Debian release:
Hmm, true. I'd missed that point, thanks. That means nothing will change for any release that's no longer supported, but it can for a release that is (or is still in development).
https://sources.debian.org/src/python3.2/
Maybe there is some archive site but I do not see revision-6
in the above link.
For an even sharper example:
https://sources.debian.org/src/python3.5/
The version available is 3.5.3-1+deb9u1, from Debian 9 "stretch". The "+debXuY" suffix means that was an update after the release was stable, which means the previous version 3.5.3-1 was the one that in the original stable release of "stretch". But that original version doesn't appear on sources. If there's another update to that package in "stretch" in its remaining months before EOL, then presumably the +deb9u1 version will disappear too.
There is a comprehensive archive site which has all versions:
http://snapshot.debian.org/package/python3.2/
But that one only provides whole package files, not individual files from within the package, so it's not suitable for fetchpatch.
I'll use Salsa. Thanks for the discussion!
And Hydra caches the sources so even if they switch, users will still get it from there.
Oh, good to know. That's quite reassuring for the impact of temporary outages, too.
Without this flag, the configure script prints a warning at the end, like this (reformatted): If you want a release build with all stable optimizations active (PGO, etc), please run ./configure --enable-optimizations We're doing a build to distribute to people for day-to-day use, doing things other than developing the Python interpreter. So that's certainly a release build -- we're the target audience for this recommendation. --- And, trying it out, upstream isn't kidding! I ran the standard benchmark suite that the CPython developers use for performance work, "pyperformance". Following its usage instructions: https://pyperformance.readthedocs.io/usage.html I ran the whole suite, like so: $ nix-shell -p ./result."$variant" --run ' cd $(mktemp -d); python -m venv venv; . venv/bin/activate pip install pyperformance pyperformance run -o ~/tmp/result.'"$variant"'.json ' and then examined the results with commands like: $ python -m pyperf compare_to --table -G \ ~/tmp/result.{$before,$after}.json Across all the benchmarks in the suite, the median speedup was 16%. (Meaning 1.16x faster; 14% less time). The middle half of them ranged from a 13% to a 22% speedup. Each of the 60 benchmarks in the suite got faster, by speedups ranging from 3% to 53%. --- One reason this isn't just the default to begin with is that, until recently, it made the build a lot slower. What it does is turn on profile-guided optimization, which means first build for profiling, then run some task to get a profile, then build again using the profile. And, short of further customization, the task it would use would be nearly the full test suite, which includes a lot of expensive and slow tests, and can easily take half an hour to run. Happily, in 2019 an upstream developer did the work to carefully select a more appropriate set of tests to use for the profile: python/cpython@4e16a4a31 https://bugs.python.org/issue36044 This suite takes just 2 minutes to run. And the resulting final build is actually slightly faster than with the much longer suite, at least as measured by those standard "pyperformance" benchmarks. That work went into the 3.8 release, but the same list works great if used on older releases too. So, start passing that --enable-optimizations flag; and backport that good-for-PGO set of tests, so that we use it on all releases.
Sorry about this bit -- I accidentally rebased the branch onto current master. Undid that, but it seems the code-owners review requests persist. |
Thanks @FRidh and @jtojnar for the reviews! Pushed a version that makes all the suggested changes from the last round of review. Please take a look. Tested this new version of this PR on all Python versions -- attributes python39, python38, python37, python36, python35, python27 -- and the build logs show the intended changes are present. I haven't re-run benchmarks, because the changes from the previous version of this PR are only in how the patches are organized and applied, and the resulting build is unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff LGTM. Tested by running nixpkgs-review pr 84072 -p python37
, and get the following error (seems to be some artifact of my local PC, hence the approval):
/build/ccy2ZovP.s: Assembler messages:
/build/ccy2ZovP.s: Fatal error: can't write 62 bytes to section .text of tree-ssa-phiopt.o: 'No space left on device'
/build/ccy2ZovP.s: Fatal error: can't close tree-ssa-phiopt.o: No space left on device
make[3]: *** [Makefile:1116: tree-ssa-phiopt.o] Error 1
make[3]: *** Waiting for unfinished jobs....
../../gcc-9.3.0/gcc/tree-ssa-math-opts.c:3865:1: fatal error: error writing to /build/ccmoyYZ3.s: No space left on device
3865 | }
| ^
compilation terminated.
make[3]: *** [Makefile:1116: tree-ssa-math-opts.o] Error 1
../../gcc-9.3.0/gcc/tree-ssa-reassoc.c:6355:1: fatal error: error writing to /build/ccxL3Bnc.s: No space left on device
6355 | }
| ^
compilation terminated.
make[3]: *** [Makefile:1116: tree-ssa-reassoc.o] Error 1
rm gcc.pod
make[3]: Leaving directory '/build/build/gcc'
make[2]: *** [Makefile:4830: all-stageprofile-gcc] Error 2
make[2]: Leaving directory '/build/build'
make[1]: *** [Makefile:22762: stageprofile-bubble] Error 2
make[1]: Leaving directory '/build/build'
make: *** [Makefile:23016: profiledbootstrap] Error 2
builder for '/nix/store/z272zdw8hqcii4bl8l07vzd60qf6y87y-gcc-9.3.0.drv' failed with exit code 2
Thanks! Yeah, I find that the rebuild of GCC requires something over 5 GB of free disk space. The bulk of that gets freed as soon as that build is done, it doesn't accumulate; but between that and the stuff that does accumulate, I had to spend some time freeing up space elsewhere on my disk when I started working on this PR. |
Yeah, I have ~500 GB free, so no idea what's causing this. Maybe a SWAP partition or something. Oh well. Not really relevant. |
@drewrisinger Perhaps you have |
I was building on Ubuntu 18.04 LTS, not NixOS. could cause some differences. |
In particular this will let us use patches that apply to configure.ac.
I took a close look at how Debian builds the Python interpreter, because I noticed it ran substantially faster than the one in nixpkgs and I was curious why. One thing that I found made a material difference in performance was this pair of linker flags (passed to the compiler): -Wl,-O1 -Wl,-Bsymbolic-functions In other words, effectively the linker gets passed the flags: -O1 -Bsymbolic-functions Doing the same thing in nixpkgs turns out to make the interpreter run about 6% faster, which is quite a big win for such an easy change. So, let's apply it. --- I had not known there was a `-O1` flag for the *linker*! But indeed there is. These flags are unrelated to "link-time optimization" (LTO), despite the latter's name. LTO means doing classic compiler optimizations on the actual code, at the linking step when it becomes possible to do them with cross-object-file information. These two flags, by contrast, cause the linker to make certain optimizations within the scope of its job as the linker. Documentation is here, though sparse: https://sourceware.org/binutils/docs-2.31/ld/Options.html The meaning of -O1 was explained in more detail in this LWN article: https://lwn.net/Articles/192624/ Apparently it makes the resulting symbol table use a bigger hash table, so the load factor is smaller and lookups are faster. Cool. As for -Bsymbolic-functions, the documentation indicates that it's a way of saving lookups through the symbol table entirely. There can apparently be situations where it changes the behavior of a program, specifically if the program relies on linker tricks to provide customization features: https://bugs.launchpad.net/ubuntu/+source/xfe/+bug/644645 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637184#35 But I'm pretty sure CPython doesn't permit that kind of trick: you don't load a shared object that tries to redefine some symbol found in the interpreter core. The stronger reason I'm confident using -Bsymbolic-functions is safe, though, is empirical. Both Debian and Ubuntu have been shipping a Python built this way since forever -- it was introduced for the Python 2.4 and 2.5 in Ubuntu "hardy", and Debian "lenny", released in 2008 and 2009. In those 12 years they haven't seen a need to drop this flag; and I've been unable to locate any reports of trouble related to it, either on the Web in general or on the Debian bug tracker. (There are reports of a handful of other programs breaking with it, but not Python/CPython.) So that seems like about as thorough testing as one could hope for. --- As for the performance impact: I ran CPython upstream's preferred benchmark suite, "pyperformance", in the same way as described in the previous commit. On top of that commit's change, the results across the 60 benchmarks in the suite are: The median is 6% faster. The middle half (aka interquartile range) is from 4% to 8% faster. Out of 60 benchmarks, 3 come out slower, by 1-4%. At the other end, 5 are at least 10% faster, and one is 17% faster. So, that's quite a material speedup! I don't know how big the effect of these flags is for other software; but certainly CPython tends to do plenty of dynamic linking, as that's how it loads extension modules, which are ubiquitous in the stdlib as well as popular third-party libraries. So perhaps that helps explain why optimizing the dynamic linker has such an impact.
Thanks @FRidh and @jtojnar for the further reviews! Just pushed a version which I believe makes all the suggested changes. I've tested this version in the same way as the previous version:
One of the patches (for 3.7) comes via plain
From the discussion in this thread about a similar error: |
Thanks. Right, |
Can you describe the scenario in which you're concerned that the checksum might change? As @jtojnar observed upthread, we're downloading here from a Git repository. It's not asking the server to generate a Git diff between two files (or trees) -- rather it's downloading the blob that's at a specific path in a specific commit's tree. That blob happens to be a diff, which is of interest to us, but it's not something the server (or the Git repo) knows at all; this isn't a case where the server could decide to tweak the diff format. |
Usually I don't bump, but I can't wait to see the effects of this PR 😋. |
|
Thanks @FRidh and everyone for the reviews and helping push this through! I am curious about the trouble. I figure that new thread is the best place to discuss that, so I'll ask there. |
PR that, for reproducibilty, disables optimizations again #107965. |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/why-is-the-nix-compiled-python-slower/18717/13 |
Motivation for this change
This PR adds two different kinds of flags to enable different optimizations in building the CPython interpreter. One makes the resulting interpreter run about 16% faster, and the other makes it about 6% faster; the combined speedup is 25%.
One of the flags is
./configure --enable-optimizations
. Upstream recommends this by having ./configure print a warning if omitted:We're doing a build to distribute to people for day-to-day use, doing things other than developing the Python interpreter. So that's certainly a release build -- this means us.
And it turns out they aren't kidding! Running upstream's own preferred benchmarking suite, pyperformance, before and after adding this flag finds a 16% speedup, as the median across the 60 benchmarks in the suite.
This one makes the Python build take somewhat longer, because PGO means first building an instrumented version, running that through some task to get a profile, then building again using the profile.
Fortunately in 2019 a good, short profile task was developed: it takes about 2min to run, and produces good results. That appears upstream in 3.8, but it takes just a line to backport the same task (it's just a list of files in the test suite) to each older version, so we do that. That keeps the effect on build time moderate, and well worthwhile for a 16% speedup to users' Python applications.
The two other flags are
-Wl,-O1 -Wl,-Bsymbolic-functions
, passed to the compiler at each linking step. These cause the linker to get the flags-O1 -Bsymbolic-functions
, and those turn on certain optimizations. (Details in commit message.)I learned about these flags from a close look at how Debian builds the Python interpreter. Both have been used in every Debian and Ubuntu release since 2008;
-Wl,-O1
has been there even longer. Adding them on top of--enable-optimizations
gives an additional 6% median speedup.Benchmark results for the combined changes: out of the 60 benchmarks in the suite, the median speedup is 25% (meaning 1.25x faster, taking 20% less time.) The middle half range from 18% to 32% faster. All get at least a little faster; two of them by less than 10%, and two of them by over 50%.
Those results are all for Python 3.7, as that's the default
python3
. Python 3.9, 3.8, and 3.6 see similar, slightly smaller results, with median speedups of 22%, 22%, and 23% respectively. All results are on my Linux desktop, with an Intel Core i7-8700K (Coffee Lake) CPU.For numbers on each individual benchmark in the suite, see:
https://gist.github.com/gnprice/e218f23d97e19a9dd1a15fe27bc61b5f
Things done
sandbox
innix.conf
on non-NixOS linux)nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
./result/bin/
)nix path-info -S
before and after)