cpython: Use optimizations, for a 25% speedup. #84072

gnprice · 2020-04-02T04:53:21Z

Motivation for this change

This PR adds two different kinds of flags to enable different optimizations in building the CPython interpreter. One makes the resulting interpreter run about 16% faster, and the other makes it about 6% faster; the combined speedup is 25%.

One of the flags is ./configure --enable-optimizations. Upstream recommends this by having ./configure print a warning if omitted:

If you want a release build with all stable optimizations active
(PGO, etc), please run ./configure --enable-optimizations

We're doing a build to distribute to people for day-to-day use, doing things other than developing the Python interpreter. So that's certainly a release build -- this means us.

And it turns out they aren't kidding! Running upstream's own preferred benchmarking suite, pyperformance, before and after adding this flag finds a 16% speedup, as the median across the 60 benchmarks in the suite.

This one makes the Python build take somewhat longer, because PGO means first building an instrumented version, running that through some task to get a profile, then building again using the profile.

Fortunately in 2019 a good, short profile task was developed: it takes about 2min to run, and produces good results. That appears upstream in 3.8, but it takes just a line to backport the same task (it's just a list of files in the test suite) to each older version, so we do that. That keeps the effect on build time moderate, and well worthwhile for a 16% speedup to users' Python applications.

The two other flags are -Wl,-O1 -Wl,-Bsymbolic-functions, passed to the compiler at each linking step. These cause the linker to get the flags -O1 -Bsymbolic-functions, and those turn on certain optimizations. (Details in commit message.)

I learned about these flags from a close look at how Debian builds the Python interpreter. Both have been used in every Debian and Ubuntu release since 2008; -Wl,-O1 has been there even longer. Adding them on top of --enable-optimizations gives an additional 6% median speedup.

Benchmark results for the combined changes: out of the 60 benchmarks in the suite, the median speedup is 25% (meaning 1.25x faster, taking 20% less time.) The middle half range from 18% to 32% faster. All get at least a little faster; two of them by less than 10%, and two of them by over 50%.

Those results are all for Python 3.7, as that's the default python3. Python 3.9, 3.8, and 3.6 see similar, slightly smaller results, with median speedups of 22%, 22%, and 23% respectively. All results are on my Linux desktop, with an Intel Core i7-8700K (Coffee Lake) CPU.

For numbers on each individual benchmark in the suite, see:
https://gist.github.com/gnprice/e218f23d97e19a9dd1a15fe27bc61b5f

Things done

The ./configure script prints a warning when passed this flag, starting with 3.7: configure: WARNING: unrecognized options: --with-threads The reason is that there's no longer such a thing as a build without threads. Eliminate the warning, by only passing the flag on the older releases that accept it. Upstream change and discussion: python/cpython@a6a4dc816 https://bugs.python.org/issue31370

jonringer · 2020-04-02T06:37:07Z

@GrahamcOfBorg build python2
@GrahamcOfBorg build python3
@GrahamcOfBorg build python35
@GrahamcOfBorg build python36
@GrahamcOfBorg build python38
@GrahamcOfBorg build python37.pkgs.requests
@GrahamcOfBorg build azure-cli

FRidh · 2020-04-02T06:48:35Z

How much time does it take to build python with and without?

FRidh · 2020-04-02T06:49:48Z

@jonringer cpython 3.7 is used for bootstrapping, ofborg won't like that 😛

gnprice · 2020-04-02T07:23:53Z

How much does does it take to build python with and without?

On my desktop, building python3 with these changes takes 295s (4m55s). On master (at commit 1992768), it takes 101s (1m41s).

That's with significant parallelism available, and so the majority of the additional wall-clock time is for running the profiling task, which runs without parallelism; that step contributed 133s (2m13s).

FRidh · 2020-04-02T07:28:44Z

Earlier PR #43442.

gnprice · 2020-04-02T07:34:11Z

Earlier PR #43442.

Ah, yeah -- I saw that one, and should have linked to it here. That PR was to turn on one of these flags, --enable-optimizations.

It looks like there weren't any benchmarks done in that thread; I think the benchmark results turn out to make a strong case that these optimizations are valuable.

The impact on build time was also much steeper in that version, because the upstream work to develop the new short profile task hadn't yet happened then.

dnadlinger · 2020-04-02T17:03:50Z

--with-lto might be worth a look as well; in my tests (custom Nix 3.8 build) I saw quite a big improvement (> ~15%) from LTO on some particular project I cared about.

Mic92 · 2020-04-02T17:07:46Z

--with-lto might be worth a look as well; in my tests (custom Nix 3.8 build) I saw quite a big improvement (> ~15%) from LTO on some particular project I cared about.

Is the memory consumptions for this build still reasonable?

dnadlinger · 2020-04-02T17:11:12Z

Is the memory consumptions for this build still reasonable?

What is "reasonable"? I wouldn't have noticed unless it had been more than a few GB, you'd need to re-measure for an accurate figure (adding the --with-lto configure flag is the only change required).

Mic92 · 2020-04-02T19:24:04Z

Is the memory consumptions for this build still reasonable?

What is "reasonable"? I wouldn't have noticed unless it had been more than a few GB, you'd need to re-measure for an accurate figure (adding the --with-lto configure flag is the only change required).

I would say if it is still build-able on a machine with 4GB it should be fine.

gnprice · 2020-04-03T04:25:30Z

--with-lto might be worth a look as well; in my tests (custom Nix 3.8 build) I saw quite a big improvement (> ~15%) from LTO on some particular project I cared about.

I tried adding --with-lto on top of the changes in this PR. On the pyperformance suite, it was mostly a wash, and made things run a bit slower more often than faster.

Which surprised me! That's what I found the other day on an early draft of this PR, too: a Python built with PGO + LTO runs a bit slower than one with PGO alone, which is why I left LTO out of this PR. But I measured it again with a fresh build just now, and got the same surprising result.

Detailed table:

$ python -m pyperf compare_to py37-nix{.pgo-linkopt.a7*,.pgo-linkopt-lto.0a*}.json --table -G | nl -ba
     1	+------------------------+--------------------------------+------------------------------------+
     2	| Benchmark              | py37-nix.pgo-linkopt.a74fa3abc | py37-nix.pgo-linkopt-lto.0ad69d097 |
     3	+========================+================================+====================================+
     4	| chaos                  | 113 ms                         | 109 ms: 1.04x faster (-3%)         |
     5	+------------------------+--------------------------------+------------------------------------+
     6	| pickle_dict            | 27.2 us                        | 26.3 us: 1.03x faster (-3%)        |
     7	+------------------------+--------------------------------+------------------------------------+
     8	| dulwich_log            | 66.4 ms                        | 64.9 ms: 1.02x faster (-2%)        |
     9	+------------------------+--------------------------------+------------------------------------+
    10	| sympy_str              | 373 ms                         | 368 ms: 1.01x faster (-1%)         |
    11	+------------------------+--------------------------------+------------------------------------+
    12	| sympy_integrate        | 28.8 ms                        | 28.5 ms: 1.01x faster (-1%)        |
    13	+------------------------+--------------------------------+------------------------------------+
    14	| scimark_fft            | 319 ms                         | 317 ms: 1.01x faster (-1%)         |
    15	+------------------------+--------------------------------+------------------------------------+
    16	| go                     | 247 ms                         | 246 ms: 1.01x faster (-1%)         |
    17	+------------------------+--------------------------------+------------------------------------+
    18	| unpickle               | 12.6 us                        | 12.7 us: 1.01x slower (+1%)        |
    19	+------------------------+--------------------------------+------------------------------------+
    20	| pyflate                | 640 ms                         | 647 ms: 1.01x slower (+1%)         |
    21	+------------------------+--------------------------------+------------------------------------+
    22	| xml_etree_process      | 81.4 ms                        | 82.6 ms: 1.01x slower (+1%)        |
    23	+------------------------+--------------------------------+------------------------------------+
    24	| sqlalchemy_imperative  | 33.1 ms                        | 33.5 ms: 1.01x slower (+1%)        |
    25	+------------------------+--------------------------------+------------------------------------+
    26	| deltablue              | 7.23 ms                        | 7.33 ms: 1.01x slower (+1%)        |
    27	+------------------------+--------------------------------+------------------------------------+
    28	| genshi_text            | 29.1 ms                        | 29.5 ms: 1.01x slower (+1%)        |
    29	+------------------------+--------------------------------+------------------------------------+
    30	| spectral_norm          | 124 ms                         | 126 ms: 1.02x slower (+2%)         |
    31	+------------------------+--------------------------------+------------------------------------+
    32	| pickle_pure_python     | 455 us                         | 462 us: 1.02x slower (+2%)         |
    33	+------------------------+--------------------------------+------------------------------------+
    34	| richards               | 72.4 ms                        | 73.7 ms: 1.02x slower (+2%)        |
    35	+------------------------+--------------------------------+------------------------------------+
    36	| regex_dna              | 161 ms                         | 165 ms: 1.02x slower (+2%)         |
    37	+------------------------+--------------------------------+------------------------------------+
    38	| regex_compile          | 180 ms                         | 184 ms: 1.02x slower (+2%)         |
    39	+------------------------+--------------------------------+------------------------------------+
    40	| tornado_http           | 154 ms                         | 157 ms: 1.02x slower (+2%)         |
    41	+------------------------+--------------------------------+------------------------------------+
    42	| genshi_xml             | 60.2 ms                        | 61.6 ms: 1.02x slower (+2%)        |
    43	+------------------------+--------------------------------+------------------------------------+
    44	| sqlalchemy_declarative | 159 ms                         | 163 ms: 1.02x slower (+2%)         |
    45	+------------------------+--------------------------------+------------------------------------+
    46	| logging_format         | 10.1 us                        | 10.3 us: 1.02x slower (+2%)        |
    47	+------------------------+--------------------------------+------------------------------------+
    48	| float                  | 113 ms                         | 116 ms: 1.03x slower (+3%)         |
    49	+------------------------+--------------------------------+------------------------------------+
    50	| hexiom                 | 10.0 ms                        | 10.3 ms: 1.03x slower (+3%)        |
    51	+------------------------+--------------------------------+------------------------------------+
    52	| mako                   | 17.1 ms                        | 17.6 ms: 1.03x slower (+3%)        |
    53	+------------------------+--------------------------------+------------------------------------+
    54	| json_loads             | 23.7 us                        | 24.4 us: 1.03x slower (+3%)        |
    55	+------------------------+--------------------------------+------------------------------------+
    56	| python_startup_no_site | 6.76 ms                        | 6.98 ms: 1.03x slower (+3%)        |
    57	+------------------------+--------------------------------+------------------------------------+
    58	| unpickle_list          | 4.01 us                        | 4.15 us: 1.03x slower (+3%)        |
    59	+------------------------+--------------------------------+------------------------------------+
    60	| pickle                 | 8.77 us                        | 9.09 us: 1.04x slower (+4%)        |
    61	+------------------------+--------------------------------+------------------------------------+
    62	| scimark_monte_carlo    | 101 ms                         | 105 ms: 1.04x slower (+4%)         |
    63	+------------------------+--------------------------------+------------------------------------+
    64	| unpickle_pure_python   | 361 us                         | 376 us: 1.04x slower (+4%)         |
    65	+------------------------+--------------------------------+------------------------------------+
    66	| chameleon              | 9.28 ms                        | 9.70 ms: 1.05x slower (+5%)        |
    67	+------------------------+--------------------------------+------------------------------------+
    68	| json_dumps             | 11.4 ms                        | 11.9 ms: 1.05x slower (+5%)        |
    69	+------------------------+--------------------------------+------------------------------------+
    70	| telco                  | 5.50 ms                        | 5.77 ms: 1.05x slower (+5%)        |
    71	+------------------------+--------------------------------+------------------------------------+
    72	| xml_etree_iterparse    | 96.2 ms                        | 101 ms: 1.05x slower (+5%)         |
    73	+------------------------+--------------------------------+------------------------------------+
    74	| fannkuch               | 479 ms                         | 504 ms: 1.05x slower (+5%)         |
    75	+------------------------+--------------------------------+------------------------------------+
    76	| sqlite_synth           | 2.59 us                        | 2.73 us: 1.05x slower (+5%)        |
    77	+------------------------+--------------------------------+------------------------------------+
    78	| xml_etree_generate     | 99.2 ms                        | 105 ms: 1.06x slower (+6%)         |
    79	+------------------------+--------------------------------+------------------------------------+
    80	| xml_etree_parse        | 127 ms                         | 138 ms: 1.09x slower (+9%)         |
    81	+------------------------+--------------------------------+------------------------------------+
    82	
    83	Not significant (21): 2to3; logging_simple; sympy_expand; sympy_sum; crypto_pyaes; regex_v8; django_template; pickle_list; logging_silent; raytrace; nqueens; regex_effbot; scimark_lu; unpack_sequence; pidigits; scimark_sor; meteor_contest; scimark_sparse_mat_mult; python_startup; pathlib; nbody

So 7/60 benchmarks get faster; 21/60 no change; 32/60 get slower. The median is 1% slower, and the quartiles 3% slower to 0%. The extremes are 9% slower and 4% faster.

OTOH without PGO, starting from a baseline of the expression in nixpkgs master, LTO is quite helpful:

$ python -m pyperf compare_to py37-nix{,.lto}.json --table -G | nl -ba
     1	+-------------------------+----------+-------------------------------+
     2	| Benchmark               | py37-nix | py37-nix.lto                  |
     3	+=========================+==========+===============================+
     4	| pickle_dict             | 39.9 us  | 34.2 us: 1.17x faster (-14%)  |
     5	+-------------------------+----------+-------------------------------+
     6	| pickle_list             | 4.97 us  | 4.38 us: 1.14x faster (-12%)  |
     7	+-------------------------+----------+-------------------------------+
     8	| xml_etree_process       | 111 ms   | 101 ms: 1.10x faster (-9%)    |
     9	+-------------------------+----------+-------------------------------+
    10	| scimark_monte_carlo     | 134 ms   | 123 ms: 1.09x faster (-8%)    |
    11	+-------------------------+----------+-------------------------------+
    12	| xml_etree_iterparse     | 114 ms   | 106 ms: 1.08x faster (-7%)    |
    13	+-------------------------+----------+-------------------------------+
    14	| scimark_fft             | 390 ms   | 362 ms: 1.08x faster (-7%)    |
    15	+-------------------------+----------+-------------------------------+
    16	| xml_etree_generate      | 131 ms   | 122 ms: 1.07x faster (-7%)    |
    17	+-------------------------+----------+-------------------------------+
    18	| mako                    | 23.3 ms  | 21.8 ms: 1.07x faster (-7%)   |
    19	+-------------------------+----------+-------------------------------+
    20	| pyflate                 | 832 ms   | 777 ms: 1.07x faster (-7%)    |
    21	+-------------------------+----------+-------------------------------+
    22	| float                   | 133 ms   | 124 ms: 1.07x faster (-6%)    |
    23	+-------------------------+----------+-------------------------------+
    24	| python_startup_no_site  | 7.75 ms  | 7.25 ms: 1.07x faster (-6%)   |
    25	+-------------------------+----------+-------------------------------+
    26	| sqlite_synth            | 3.33 us  | 3.12 us: 1.07x faster (-6%)   |
    27	+-------------------------+----------+-------------------------------+
    28	| richards                | 101 ms   | 94.3 ms: 1.07x faster (-6%)   |
    29	+-------------------------+----------+-------------------------------+
    30	| json_dumps              | 14.4 ms  | 13.5 ms: 1.07x faster (-6%)   |
    31	+-------------------------+----------+-------------------------------+
    32	| sympy_expand            | 734 ms   | 688 ms: 1.07x faster (-6%)    |
    33	+-------------------------+----------+-------------------------------+
    34	| pickle                  | 13.2 us  | 12.4 us: 1.07x faster (-6%)   |
    35	+-------------------------+----------+-------------------------------+
    36	| unpickle_list           | 4.25 us  | 3.99 us: 1.07x faster (-6%)   |
    37	+-------------------------+----------+-------------------------------+
    38	| go                      | 337 ms   | 317 ms: 1.07x faster (-6%)    |
    39	+-------------------------+----------+-------------------------------+
    40	| nbody                   | 138 ms   | 130 ms: 1.06x faster (-6%)    |
    41	+-------------------------+----------+-------------------------------+
    42	| sympy_integrate         | 35.2 ms  | 33.2 ms: 1.06x faster (-6%)   |
    43	+-------------------------+----------+-------------------------------+
    44	| scimark_lu              | 227 ms   | 215 ms: 1.06x faster (-5%)    |
    45	+-------------------------+----------+-------------------------------+
    46	| scimark_sor             | 243 ms   | 230 ms: 1.06x faster (-5%)    |
    47	+-------------------------+----------+-------------------------------+
    48	| unpickle                | 16.0 us  | 15.2 us: 1.06x faster (-5%)   |
    49	+-------------------------+----------+-------------------------------+
    50	| xml_etree_parse         | 146 ms   | 139 ms: 1.05x faster (-5%)    |
    51	+-------------------------+----------+-------------------------------+
    52	| pidigits                | 173 ms   | 164 ms: 1.05x faster (-5%)    |
    53	+-------------------------+----------+-------------------------------+
    54	| python_startup          | 10.3 ms  | 9.81 ms: 1.05x faster (-5%)   |
    55	+-------------------------+----------+-------------------------------+
    56	| raytrace                | 678 ms   | 645 ms: 1.05x faster (-5%)    |
    57	+-------------------------+----------+-------------------------------+
    58	| sqlalchemy_declarative  | 188 ms   | 179 ms: 1.05x faster (-5%)    |
    59	+-------------------------+----------+-------------------------------+
    60	| deltablue               | 10.1 ms  | 9.65 ms: 1.05x faster (-5%)   |
    61	+-------------------------+----------+-------------------------------+
    62	| regex_compile           | 225 ms   | 215 ms: 1.04x faster (-4%)    |
    63	+-------------------------+----------+-------------------------------+
    64	| nqueens                 | 117 ms   | 112 ms: 1.04x faster (-4%)    |
    65	+-------------------------+----------+-------------------------------+
    66	| unpickle_pure_python    | 480 us   | 460 us: 1.04x faster (-4%)    |
    67	+-------------------------+----------+-------------------------------+
    68	| meteor_contest          | 113 ms   | 109 ms: 1.04x faster (-4%)    |
    69	+-------------------------+----------+-------------------------------+
    70	| unpack_sequence         | 47.8 ns  | 46.0 ns: 1.04x faster (-4%)   |
    71	+-------------------------+----------+-------------------------------+
    72	| 2to3                    | 383 ms   | 369 ms: 1.04x faster (-4%)    |
    73	+-------------------------+----------+-------------------------------+
    74	| logging_format          | 13.7 us  | 13.2 us: 1.04x faster (-4%)   |
    75	+-------------------------+----------+-------------------------------+
    76	| pickle_pure_python      | 600 us   | 579 us: 1.04x faster (-4%)    |
    77	+-------------------------+----------+-------------------------------+
    78	| sympy_sum               | 282 ms   | 272 ms: 1.04x faster (-3%)    |
    79	+-------------------------+----------+-------------------------------+
    80	| crypto_pyaes            | 130 ms   | 126 ms: 1.04x faster (-3%)    |
    81	+-------------------------+----------+-------------------------------+
    82	| logging_silent          | 232 ns   | 224 ns: 1.04x faster (-3%)    |
    83	+-------------------------+----------+-------------------------------+
    84	| genshi_text             | 37.2 ms  | 36.0 ms: 1.03x faster (-3%)   |
    85	+-------------------------+----------+-------------------------------+
    86	| sympy_str               | 457 ms   | 442 ms: 1.03x faster (-3%)    |
    87	+-------------------------+----------+-------------------------------+
    88	| chaos                   | 149 ms   | 144 ms: 1.03x faster (-3%)    |
    89	+-------------------------+----------+-------------------------------+
    90	| fannkuch                | 548 ms   | 531 ms: 1.03x faster (-3%)    |
    91	+-------------------------+----------+-------------------------------+
    92	| regex_effbot            | 2.90 ms  | 2.81 ms: 1.03x faster (-3%)   |
    93	+-------------------------+----------+-------------------------------+
    94	| telco                   | 7.27 ms  | 7.05 ms: 1.03x faster (-3%)   |
    95	+-------------------------+----------+-------------------------------+
    96	| json_loads              | 28.5 us  | 27.6 us: 1.03x faster (-3%)   |
    97	+-------------------------+----------+-------------------------------+
    98	| django_template         | 77.4 ms  | 75.1 ms: 1.03x faster (-3%)   |
    99	+-------------------------+----------+-------------------------------+
   100	| pathlib                 | 23.4 ms  | 22.8 ms: 1.03x faster (-3%)   |
   101	+-------------------------+----------+-------------------------------+
   102	| genshi_xml              | 79.0 ms  | 77.0 ms: 1.03x faster (-2%)   |
   103	+-------------------------+----------+-------------------------------+
   104	| logging_simple          | 12.2 us  | 11.9 us: 1.02x faster (-2%)   |
   105	+-------------------------+----------+-------------------------------+
   106	| dulwich_log             | 83.8 ms  | 81.9 ms: 1.02x faster (-2%)   |
   107	+-------------------------+----------+-------------------------------+
   108	| hexiom                  | 13.0 ms  | 12.7 ms: 1.02x faster (-2%)   |
   109	+-------------------------+----------+-------------------------------+
   110	| tornado_http            | 184 ms   | 180 ms: 1.02x faster (-2%)    |
   111	+-------------------------+----------+-------------------------------+
   112	| regex_v8                | 23.3 ms  | 22.9 ms: 1.02x faster (-2%)   |
   113	+-------------------------+----------+-------------------------------+
   114	| regex_dna               | 165 ms   | 167 ms: 1.01x slower (+1%)    |
   115	+-------------------------+----------+-------------------------------+
   116	| scimark_sparse_mat_mult | 4.52 ms  | 19.7 ms: 4.36x slower (+336%) |
   117	+-------------------------+----------+-------------------------------+
   118	
   119	Not significant (3): chameleon; sqlalchemy_imperative; spectral_norm

Median 4% faster; quartiles 3% to 7% faster; extremes 1% slower to 17% faster (ignoring one benchmark reading >4x slower, where the timing clearly hiccupped.)

So if you were comparing --with-lto to the unmodified nixpkgs build, then your result is in the range of what I'm seeing in the benchmark suite.

If you were comparing --with-lto --enable-optimizations to just --enable-optimizations, then that is an interesting contrast! I'd be curious what you get if you take the same two builds where you found one interpreter ran 15% faster than the other on your workload, and try them out on pyperformance. That'd help resolve whether the story is that none of the benchmarks represent your workload, or that some other variable (about the exact builds, our respective machines, etc.) causes you to get different results from me even on the same benchmarks.

dnadlinger · 2020-04-03T15:59:46Z

@gnprice: Many thanks for the detailed (!) response! Let's just go for --enable-optimizations for the time being, as this is also unambiguously recommended by upstream.

I'd love to post more detailed data, but unfortunately, it will take me a while to access the machine where I ran the benchmarks due to lockdown-related silliness.

pkgs/development/interpreters/python/cpython/2.7/default.nix

pkgs/development/interpreters/python/cpython/default.nix

jtojnar · 2020-04-22T03:14:04Z

pkgs/development/interpreters/python/cpython/2.7/link-opt.patch

@@ -0,0 +1,30 @@
+Call the linker with -O1 -Bsymbolic-functions.
+
+Based on debian/patches/link-opt.diff in the Debian package `python3`.


Can we get a link to salsa and list the changes from the Debian patch?

Sure. Here's the patch on salsa (as of latest master):
https://salsa.debian.org/cpython-team/python3/-/blob/02735af38/debian/patches/link-opt.diff

This patch has exactly the same content -- the only change is in the surrounding context.

Similarly, within Debian, the patch's content has been unchanged (the only changes to the patch file have been to update the context and line numbers) since LDCXXSHARED was introduced to the upstream version of this code in Python 3.2 (and 2.7). Here's that patch as it appeared for python3.2 in Debian 7 Wheezy, released in 2013:
https://sources.debian.org/src/python3.2/3.2.3-7/debian/patches/link-opt.diff/

And here it is from the python2.4 in Debian 5 Lenny, released in 2009:
https://sources.debian.org/src/python2.4/2.4.6-1+lenny1/debian/patches/link-opt.dpatch/
That one already has the same content except for the absence of LDCXXSHARED.

I'll add into the patch files the Salsa link, and the clarification of "based on" to "has the same content".

(Oh, and the Debian versions of the patch apply to configure.ac, or on older CPython versions configure.in, because Debian makes a point of regenerating configure from those.)

Regenerating configure is not a bad idea either, just add the hook.

Oh neat -- searching a bit, I guess that'd be autoreconfHook? And we could enable it by just adding it to nativeBuildInputs, IIUC. Cool.

Happy to add a commit making that change too. I won't do that tonight, only because testing properly causes rebuilds of GCC etc. that take over an hour, and it's too late in the evening here for another round of that.

If the patches from Debian apply, it is better to fetch them using fetchpatch.

OK, can do.

Any preference on fetching from Salsa vs. sources.debian.org? (The latter being a record of actual releases.)

I think I can find all 4 of the needed versions of the patch (the context keeps changing slightly) on Salsa, and 3 of the 4 on sources -- Debian skipped Python 3.6.

Salsa is better since it is a git repository – we can pin it to a commit and it will always be there. Sources might remove older debian releases after they are no longer supported.

Mmm, that's a good consideration. On that score, I think actually sources is more dependable -- it isn't the live repository for active releases, rather is meant as a permanent archive for studying long-term history. It has all Debian releases going back to 2.0 "hamm", released in 1998:
https://sources.debian.org/stats/

These patches are probably pretty stable in their URLs on Salsa, too. But they could go away -- the package's maintainer, or a future new maintainer, could move its development to some other repo and remove this one. If it's simply moved within Salsa there'd probably be a redirect, but they could also move it to some other Git hosting service or even something else. Anything the particular maintainer finds convenient for ongoing development would be seen as 100% OK from a Debian perspective.

IIRC, sources only carry the latest revision they have for a given Debian release:

https://sources.debian.org/src/python3.2/

Maybe there is some archive site but I do not see revision -6 in the above link.

Salsa should be stable enough, we can always deal with breakage like we did when they switched from Alioth to Salsa. And Hydra caches the sources so even if they switch, users will still get it from there.

IIRC, sources only carry the latest revision they have for a given Debian release:

Hmm, true. I'd missed that point, thanks. That means nothing will change for any release that's no longer supported, but it can for a release that is (or is still in development).

https://sources.debian.org/src/python3.2/
Maybe there is some archive site but I do not see revision -6 in the above link.

For an even sharper example:
https://sources.debian.org/src/python3.5/

The version available is 3.5.3-1+deb9u1, from Debian 9 "stretch". The "+debXuY" suffix means that was an update after the release was stable, which means the previous version 3.5.3-1 was the one that in the original stable release of "stretch". But that original version doesn't appear on sources. If there's another update to that package in "stretch" in its remaining months before EOL, then presumably the +deb9u1 version will disappear too.

There is a comprehensive archive site which has all versions:
http://snapshot.debian.org/package/python3.2/
But that one only provides whole package files, not individual files from within the package, so it's not suitable for fetchpatch.

I'll use Salsa. Thanks for the discussion!

And Hydra caches the sources so even if they switch, users will still get it from there.

Oh, good to know. That's quite reassuring for the impact of temporary outages, too.

Without this flag, the configure script prints a warning at the end, like this (reformatted): If you want a release build with all stable optimizations active (PGO, etc), please run ./configure --enable-optimizations We're doing a build to distribute to people for day-to-day use, doing things other than developing the Python interpreter. So that's certainly a release build -- we're the target audience for this recommendation. --- And, trying it out, upstream isn't kidding! I ran the standard benchmark suite that the CPython developers use for performance work, "pyperformance". Following its usage instructions: https://pyperformance.readthedocs.io/usage.html I ran the whole suite, like so: $ nix-shell -p ./result."$variant" --run ' cd $(mktemp -d); python -m venv venv; . venv/bin/activate pip install pyperformance pyperformance run -o ~/tmp/result.'"$variant"'.json ' and then examined the results with commands like: $ python -m pyperf compare_to --table -G \ ~/tmp/result.{$before,$after}.json Across all the benchmarks in the suite, the median speedup was 16%. (Meaning 1.16x faster; 14% less time). The middle half of them ranged from a 13% to a 22% speedup. Each of the 60 benchmarks in the suite got faster, by speedups ranging from 3% to 53%. --- One reason this isn't just the default to begin with is that, until recently, it made the build a lot slower. What it does is turn on profile-guided optimization, which means first build for profiling, then run some task to get a profile, then build again using the profile. And, short of further customization, the task it would use would be nearly the full test suite, which includes a lot of expensive and slow tests, and can easily take half an hour to run. Happily, in 2019 an upstream developer did the work to carefully select a more appropriate set of tests to use for the profile: python/cpython@4e16a4a31 https://bugs.python.org/issue36044 This suite takes just 2 minutes to run. And the resulting final build is actually slightly faster than with the much longer suite, at least as measured by those standard "pyperformance" benchmarks. That work went into the 3.8 release, but the same list works great if used on older releases too. So, start passing that --enable-optimizations flag; and backport that good-for-PGO set of tests, so that we use it on all releases.

gnprice · 2020-05-12T06:39:17Z

gnprice requested review from edolstra, Infinisil, jonringer, nbp and Profpatsch as code owners 1 minute ago

Sorry about this bit -- I accidentally rebased the branch onto current master. Undid that, but it seems the code-owners review requests persist.

gnprice · 2020-05-12T07:01:18Z

Thanks @FRidh and @jtojnar for the reviews! Pushed a version that makes all the suggested changes from the last round of review. Please take a look.

Tested this new version of this PR on all Python versions -- attributes python39, python38, python37, python36, python35, python27 -- and the build logs show the intended changes are present. I haven't re-run benchmarks, because the changes from the previous version of this PR are only in how the patches are organized and applied, and the resulting build is unchanged.

drewrisinger

Diff LGTM. Tested by running nixpkgs-review pr 84072 -p python37, and get the following error (seems to be some artifact of my local PC, hence the approval):

/build/ccy2ZovP.s: Assembler messages:
/build/ccy2ZovP.s: Fatal error: can't write 62 bytes to section .text of tree-ssa-phiopt.o: 'No space left on device'
/build/ccy2ZovP.s: Fatal error: can't close tree-ssa-phiopt.o: No space left on device
make[3]: *** [Makefile:1116: tree-ssa-phiopt.o] Error 1
make[3]: *** Waiting for unfinished jobs....
../../gcc-9.3.0/gcc/tree-ssa-math-opts.c:3865:1: fatal error: error writing to /build/ccmoyYZ3.s: No space left on device
 3865 | }
      | ^
compilation terminated.
make[3]: *** [Makefile:1116: tree-ssa-math-opts.o] Error 1
../../gcc-9.3.0/gcc/tree-ssa-reassoc.c:6355:1: fatal error: error writing to /build/ccxL3Bnc.s: No space left on device
 6355 | }
      | ^
compilation terminated.
make[3]: *** [Makefile:1116: tree-ssa-reassoc.o] Error 1
rm gcc.pod
make[3]: Leaving directory '/build/build/gcc'
make[2]: *** [Makefile:4830: all-stageprofile-gcc] Error 2
make[2]: Leaving directory '/build/build'
make[1]: *** [Makefile:22762: stageprofile-bubble] Error 2
make[1]: Leaving directory '/build/build'
make: *** [Makefile:23016: profiledbootstrap] Error 2
builder for '/nix/store/z272zdw8hqcii4bl8l07vzd60qf6y87y-gcc-9.3.0.drv' failed with exit code 2

pkgs/development/interpreters/python/cpython/2.7/default.nix

gnprice · 2020-05-13T01:50:23Z

/build/ccy2ZovP.s: Fatal error: can't write 62 bytes to section .text of tree-ssa-phiopt.o: 'No space left on device'
/build/ccy2ZovP.s: Fatal error: can't close tree-ssa-phiopt.o: No space left on device
make[3]: *** [Makefile:1116: tree-ssa-phiopt.o] Error 1
make[3]: *** Waiting for unfinished jobs....
../../gcc-9.3.0/gcc/tree-ssa-math-opts.c:3865:1: fatal error: error writing to /build/ccmoyYZ3.s: No space left on device

Thanks!

Yeah, I find that the rebuild of GCC requires something over 5 GB of free disk space. The bulk of that gets freed as soon as that build is done, it doesn't accumulate; but between that and the stuff that does accumulate, I had to spend some time freeing up space elsewhere on my disk when I started working on this PR.

drewrisinger · 2020-05-13T13:05:10Z

Yeah, I find that the rebuild of GCC requires something over 5 GB of free disk space. The bulk of that gets freed as soon as that build is done, it doesn't accumulate; but between that and the stuff that does accumulate, I had to spend some time freeing up space elsewhere on my disk when I started working on this PR.

Yeah, I have ~500 GB free, so no idea what's causing this. Maybe a SWAP partition or something. Oh well. Not really relevant.

andersk · 2020-05-13T21:03:00Z

@drewrisinger Perhaps you have boot.tmpOnTmpfs = true? Since tmpfs by default is limited to half of physical RAM, you might try systemd.services.nix-daemon.environment.TMPDIR = "/var/tmp";.

drewrisinger · 2020-05-14T03:54:31Z

@drewrisinger Perhaps you have boot.tmpOnTmpfs = true? Since tmpfs by default is limited to half of physical RAM, you might try systemd.services.nix-daemon.environment.TMPDIR = "/var/tmp";.

I was building on Ubuntu 18.04 LTS, not NixOS. could cause some differences.

In particular this will let us use patches that apply to configure.ac.

I took a close look at how Debian builds the Python interpreter, because I noticed it ran substantially faster than the one in nixpkgs and I was curious why. One thing that I found made a material difference in performance was this pair of linker flags (passed to the compiler): -Wl,-O1 -Wl,-Bsymbolic-functions In other words, effectively the linker gets passed the flags: -O1 -Bsymbolic-functions Doing the same thing in nixpkgs turns out to make the interpreter run about 6% faster, which is quite a big win for such an easy change. So, let's apply it. --- I had not known there was a `-O1` flag for the *linker*! But indeed there is. These flags are unrelated to "link-time optimization" (LTO), despite the latter's name. LTO means doing classic compiler optimizations on the actual code, at the linking step when it becomes possible to do them with cross-object-file information. These two flags, by contrast, cause the linker to make certain optimizations within the scope of its job as the linker. Documentation is here, though sparse: https://sourceware.org/binutils/docs-2.31/ld/Options.html The meaning of -O1 was explained in more detail in this LWN article: https://lwn.net/Articles/192624/ Apparently it makes the resulting symbol table use a bigger hash table, so the load factor is smaller and lookups are faster. Cool. As for -Bsymbolic-functions, the documentation indicates that it's a way of saving lookups through the symbol table entirely. There can apparently be situations where it changes the behavior of a program, specifically if the program relies on linker tricks to provide customization features: https://bugs.launchpad.net/ubuntu/+source/xfe/+bug/644645 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637184#35 But I'm pretty sure CPython doesn't permit that kind of trick: you don't load a shared object that tries to redefine some symbol found in the interpreter core. The stronger reason I'm confident using -Bsymbolic-functions is safe, though, is empirical. Both Debian and Ubuntu have been shipping a Python built this way since forever -- it was introduced for the Python 2.4 and 2.5 in Ubuntu "hardy", and Debian "lenny", released in 2008 and 2009. In those 12 years they haven't seen a need to drop this flag; and I've been unable to locate any reports of trouble related to it, either on the Web in general or on the Debian bug tracker. (There are reports of a handful of other programs breaking with it, but not Python/CPython.) So that seems like about as thorough testing as one could hope for. --- As for the performance impact: I ran CPython upstream's preferred benchmark suite, "pyperformance", in the same way as described in the previous commit. On top of that commit's change, the results across the 60 benchmarks in the suite are: The median is 6% faster. The middle half (aka interquartile range) is from 4% to 8% faster. Out of 60 benchmarks, 3 come out slower, by 1-4%. At the other end, 5 are at least 10% faster, and one is 17% faster. So, that's quite a material speedup! I don't know how big the effect of these flags is for other software; but certainly CPython tends to do plenty of dynamic linking, as that's how it loads extension modules, which are ubiquitous in the stdlib as well as popular third-party libraries. So perhaps that helps explain why optimizing the dynamic linker has such an impact.

gnprice · 2020-05-14T04:34:31Z

Thanks @FRidh and @jtojnar for the further reviews! Just pushed a version which I believe makes all the suggested changes.

I've tested this version in the same way as the previous version:

Tested this new version of this PR on all Python versions -- attributes python39, python38, python37, python36, python35, python27 -- and the build logs show the intended changes are present.

One of the patches (for 3.7) comes via plain fetchurl instead of fetchpatch, because the latter produced the following error:

$ nix-build . -A python3Minimal 
error: anonymous function at
  /home/greg/n/nix/pkgs/pkgs/build-support/fetchurl/boot.nix:5:1
  called with unexpected argument 'meta', at
  /home/greg/n/nix/pkgs/pkgs/build-support/fetchpatch/default.nix:18:1
(use '--show-trace' to show detailed location information)

From the discussion in this thread about a similar error:
https://discourse.nixos.org/t/how-to-patch-in-an-overlay/3678
I believe what triggers that is that 3.7 is used in the transitive dependencies of fetchpatch itself.

FRidh · 2020-05-14T06:46:48Z

Thanks. Right, fetchpatch can't be used in bootstrapping. In that case it is better to include the patch in the tree, because we do not want the checksum to be incorrect suddenly.

gnprice · 2020-05-14T06:59:35Z

In that case it is better to include the patch in the tree, because we do not want the checksum to be incorrect suddenly.

Can you describe the scenario in which you're concerned that the checksum might change?

As @jtojnar observed upthread, we're downloading here from a Git repository. It's not asking the server to generate a Git diff between two files (or trees) -- rather it's downloading the blob that's at a specific path in a specific commit's tree. That blob happens to be a diff, which is of interest to us, but it's not something the server (or the Git repo) knows at all; this isn't a case where the server could decide to tweak the diff format.

doronbehar · 2020-05-31T17:37:46Z

Usually I don't bump, but I can't wait to see the effects of this PR 😋.

FRidh · 2020-06-04T18:51:37Z

fetchpatch is causing trouble #89489

gnprice · 2020-06-06T02:56:14Z

Thanks @FRidh and everyone for the reviews and helping push this through!

I am curious about the trouble. I figure that new thread is the best place to discuss that, so I'll ask there.

FRidh · 2020-12-30T15:04:13Z

PR that, for reproducibilty, disables optimizations again #107965.

nixos-discourse · 2022-04-23T19:35:57Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/why-is-the-nix-compiled-python-slower/18717/13

gnprice requested a review from FRidh as a code owner April 2, 2020 04:53

ofborg bot added the 6.topic: python label Apr 2, 2020

gnprice changed the base branch from master to staging April 2, 2020 05:10

ofborg bot added 10.rebuild-darwin-stdenv 10.rebuild-linux-stdenv 10.rebuild-darwin: 501+ 10.rebuild-darwin: 5001+ 10.rebuild-linux: 501+ 10.rebuild-linux: 5001+ labels Apr 2, 2020

drewrisinger mentioned this pull request Apr 2, 2020

Compiler performance: General observations m-labs/artiq#1380

Closed

FRidh added this to Needs review in Staging Apr 3, 2020

FRidh requested changes Apr 21, 2020

View reviewed changes

pkgs/development/interpreters/python/cpython/2.7/default.nix Outdated Show resolved Hide resolved

pkgs/development/interpreters/python/cpython/default.nix Outdated Show resolved Hide resolved

FRidh self-assigned this Apr 21, 2020

jtojnar reviewed Apr 22, 2020

View reviewed changes

FRidh added the 2.status: work-in-progress label Apr 27, 2020

gnprice force-pushed the python-build branch from 29853e4 to ba133ed Compare May 12, 2020 06:17

ofborg bot added the 2.status: merge conflict label May 12, 2020

gnprice requested review from nbp and Profpatsch as code owners May 12, 2020 06:36

gnprice force-pushed the python-build branch from 5f985e9 to 6d9bda7 Compare May 12, 2020 06:37

ofborg bot requested a review from FRidh May 12, 2020 08:22

drewrisinger approved these changes May 12, 2020

View reviewed changes

pkgs/development/interpreters/python/cpython/2.7/default.nix Show resolved Hide resolved

dnadlinger approved these changes May 12, 2020

View reviewed changes

gnprice added 2 commits May 13, 2020 21:23

cpython: Use autoreconfHook to rebuild configure script.

52c04b0

In particular this will let us use patches that apply to configure.ac.

gnprice force-pushed the python-build branch from 6d9bda7 to 480c8d1 Compare May 14, 2020 04:26

doronbehar mentioned this pull request May 15, 2020

openbox: Use python3 instead of python2 #87846

Merged

10 tasks

FRidh merged commit a2be64b into NixOS:staging Jun 4, 2020

Staging automation moved this from Needs review to Done Jun 4, 2020

gnprice deleted the python-build branch June 6, 2020 03:06

parthy mentioned this pull request Jun 9, 2020

python does not build on macOS with current staging #89912

Closed

FRidh mentioned this pull request Jan 1, 2021

[reverted] python2 and python3: reproducible builds #107965

Merged

12 tasks

		@@ -0,0 +1,30 @@
		Call the linker with -O1 -Bsymbolic-functions.

		Based on debian/patches/link-opt.diff in the Debian package `python3`.

cpython: Use optimizations, for a 25% speedup. #84072

cpython: Use optimizations, for a 25% speedup. #84072

Conversation

gnprice commented Apr 2, 2020

Motivation for this change

Things done

jonringer commented Apr 2, 2020

FRidh commented Apr 2, 2020 • edited

FRidh commented Apr 2, 2020

gnprice commented Apr 2, 2020

FRidh commented Apr 2, 2020

gnprice commented Apr 2, 2020

dnadlinger commented Apr 2, 2020

Mic92 commented Apr 2, 2020

dnadlinger commented Apr 2, 2020 • edited

Mic92 commented Apr 2, 2020

gnprice commented Apr 3, 2020 • edited

dnadlinger commented Apr 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice May 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice commented May 12, 2020

gnprice commented May 12, 2020

drewrisinger left a comment

Choose a reason for hiding this comment

gnprice commented May 13, 2020

drewrisinger commented May 13, 2020

andersk commented May 13, 2020

drewrisinger commented May 14, 2020

gnprice commented May 14, 2020 • edited

FRidh commented May 14, 2020

gnprice commented May 14, 2020 • edited

doronbehar commented May 31, 2020

FRidh commented Jun 4, 2020

gnprice commented Jun 6, 2020

FRidh commented Dec 30, 2020

nixos-discourse commented Apr 23, 2022

FRidh commented Apr 2, 2020 •

edited

dnadlinger commented Apr 2, 2020 •

edited

gnprice commented Apr 3, 2020 •

edited

dnadlinger commented Apr 3, 2020 •

edited

gnprice May 12, 2020 •

edited

gnprice commented May 14, 2020 •

edited

gnprice commented May 14, 2020 •

edited