Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpython: Use optimizations, for a 25% speedup. #84072

Merged
merged 4 commits into from Jun 4, 2020

Conversation

gnprice
Copy link
Contributor

@gnprice gnprice commented Apr 2, 2020

Motivation for this change

This PR adds two different kinds of flags to enable different optimizations in building the CPython interpreter. One makes the resulting interpreter run about 16% faster, and the other makes it about 6% faster; the combined speedup is 25%.


One of the flags is ./configure --enable-optimizations. Upstream recommends this by having ./configure print a warning if omitted:

If you want a release build with all stable optimizations active
(PGO, etc), please run ./configure --enable-optimizations

We're doing a build to distribute to people for day-to-day use, doing things other than developing the Python interpreter. So that's certainly a release build -- this means us.

And it turns out they aren't kidding! Running upstream's own preferred benchmarking suite, pyperformance, before and after adding this flag finds a 16% speedup, as the median across the 60 benchmarks in the suite.

This one makes the Python build take somewhat longer, because PGO means first building an instrumented version, running that through some task to get a profile, then building again using the profile.

Fortunately in 2019 a good, short profile task was developed: it takes about 2min to run, and produces good results. That appears upstream in 3.8, but it takes just a line to backport the same task (it's just a list of files in the test suite) to each older version, so we do that. That keeps the effect on build time moderate, and well worthwhile for a 16% speedup to users' Python applications.


The two other flags are -Wl,-O1 -Wl,-Bsymbolic-functions, passed to the compiler at each linking step. These cause the linker to get the flags -O1 -Bsymbolic-functions, and those turn on certain optimizations. (Details in commit message.)

I learned about these flags from a close look at how Debian builds the Python interpreter. Both have been used in every Debian and Ubuntu release since 2008; -Wl,-O1 has been there even longer. Adding them on top of --enable-optimizations gives an additional 6% median speedup.


Benchmark results for the combined changes: out of the 60 benchmarks in the suite, the median speedup is 25% (meaning 1.25x faster, taking 20% less time.) The middle half range from 18% to 32% faster. All get at least a little faster; two of them by less than 10%, and two of them by over 50%.

Those results are all for Python 3.7, as that's the default python3. Python 3.9, 3.8, and 3.6 see similar, slightly smaller results, with median speedups of 22%, 22%, and 23% respectively. All results are on my Linux desktop, with an Intel Core i7-8700K (Coffee Lake) CPU.

For numbers on each individual benchmark in the suite, see:
https://gist.github.com/gnprice/e218f23d97e19a9dd1a15fe27bc61b5f

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
      • python39
      • python38
      • python37
      • python36
      • python35
      • python27
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

The ./configure script prints a warning when passed this flag,
starting with 3.7:

  configure: WARNING: unrecognized options: --with-threads

The reason is that there's no longer such a thing as a build
without threads.

Eliminate the warning, by only passing the flag on the older releases
that accept it.

Upstream change and discussion:
  python/cpython@a6a4dc816
  https://bugs.python.org/issue31370
@jonringer
Copy link
Contributor

@GrahamcOfBorg build python2
@GrahamcOfBorg build python3
@GrahamcOfBorg build python35
@GrahamcOfBorg build python36
@GrahamcOfBorg build python38
@GrahamcOfBorg build python37.pkgs.requests
@GrahamcOfBorg build azure-cli

@FRidh
Copy link
Member

FRidh commented Apr 2, 2020

How much time does it take to build python with and without?

@FRidh
Copy link
Member

FRidh commented Apr 2, 2020

@jonringer cpython 3.7 is used for bootstrapping, ofborg won't like that 😛

@gnprice
Copy link
Contributor Author

gnprice commented Apr 2, 2020

How much does does it take to build python with and without?

On my desktop, building python3 with these changes takes 295s (4m55s). On master (at commit 1992768), it takes 101s (1m41s).

That's with significant parallelism available, and so the majority of the additional wall-clock time is for running the profiling task, which runs without parallelism; that step contributed 133s (2m13s).

@FRidh
Copy link
Member

FRidh commented Apr 2, 2020

Earlier PR #43442.

@gnprice
Copy link
Contributor Author

gnprice commented Apr 2, 2020

Earlier PR #43442.

Ah, yeah -- I saw that one, and should have linked to it here. That PR was to turn on one of these flags, --enable-optimizations.

It looks like there weren't any benchmarks done in that thread; I think the benchmark results turn out to make a strong case that these optimizations are valuable.

The impact on build time was also much steeper in that version, because the upstream work to develop the new short profile task hadn't yet happened then.

@dnadlinger
Copy link
Contributor

--with-lto might be worth a look as well; in my tests (custom Nix 3.8 build) I saw quite a big improvement (> ~15%) from LTO on some particular project I cared about.

@Mic92
Copy link
Member

Mic92 commented Apr 2, 2020

--with-lto might be worth a look as well; in my tests (custom Nix 3.8 build) I saw quite a big improvement (> ~15%) from LTO on some particular project I cared about.

Is the memory consumptions for this build still reasonable?

@dnadlinger
Copy link
Contributor

dnadlinger commented Apr 2, 2020

Is the memory consumptions for this build still reasonable?

What is "reasonable"? I wouldn't have noticed unless it had been more than a few GB, you'd need to re-measure for an accurate figure (adding the --with-lto configure flag is the only change required).

@Mic92
Copy link
Member

Mic92 commented Apr 2, 2020

Is the memory consumptions for this build still reasonable?

What is "reasonable"? I wouldn't have noticed unless it had been more than a few GB, you'd need to re-measure for an accurate figure (adding the --with-lto configure flag is the only change required).

I would say if it is still build-able on a machine with 4GB it should be fine.

@gnprice
Copy link
Contributor Author

gnprice commented Apr 3, 2020

--with-lto might be worth a look as well; in my tests (custom Nix 3.8 build) I saw quite a big improvement (> ~15%) from LTO on some particular project I cared about.

I tried adding --with-lto on top of the changes in this PR. On the pyperformance suite, it was mostly a wash, and made things run a bit slower more often than faster.

Which surprised me! That's what I found the other day on an early draft of this PR, too: a Python built with PGO + LTO runs a bit slower than one with PGO alone, which is why I left LTO out of this PR. But I measured it again with a fresh build just now, and got the same surprising result.

Detailed table:

$ python -m pyperf compare_to py37-nix{.pgo-linkopt.a7*,.pgo-linkopt-lto.0a*}.json --table -G | nl -ba
     1	+------------------------+--------------------------------+------------------------------------+
     2	| Benchmark              | py37-nix.pgo-linkopt.a74fa3abc | py37-nix.pgo-linkopt-lto.0ad69d097 |
     3	+========================+================================+====================================+
     4	| chaos                  | 113 ms                         | 109 ms: 1.04x faster (-3%)         |
     5	+------------------------+--------------------------------+------------------------------------+
     6	| pickle_dict            | 27.2 us                        | 26.3 us: 1.03x faster (-3%)        |
     7	+------------------------+--------------------------------+------------------------------------+
     8	| dulwich_log            | 66.4 ms                        | 64.9 ms: 1.02x faster (-2%)        |
     9	+------------------------+--------------------------------+------------------------------------+
    10	| sympy_str              | 373 ms                         | 368 ms: 1.01x faster (-1%)         |
    11	+------------------------+--------------------------------+------------------------------------+
    12	| sympy_integrate        | 28.8 ms                        | 28.5 ms: 1.01x faster (-1%)        |
    13	+------------------------+--------------------------------+------------------------------------+
    14	| scimark_fft            | 319 ms                         | 317 ms: 1.01x faster (-1%)         |
    15	+------------------------+--------------------------------+------------------------------------+
    16	| go                     | 247 ms                         | 246 ms: 1.01x faster (-1%)         |
    17	+------------------------+--------------------------------+------------------------------------+
    18	| unpickle               | 12.6 us                        | 12.7 us: 1.01x slower (+1%)        |
    19	+------------------------+--------------------------------+------------------------------------+
    20	| pyflate                | 640 ms                         | 647 ms: 1.01x slower (+1%)         |
    21	+------------------------+--------------------------------+------------------------------------+
    22	| xml_etree_process      | 81.4 ms                        | 82.6 ms: 1.01x slower (+1%)        |
    23	+------------------------+--------------------------------+------------------------------------+
    24	| sqlalchemy_imperative  | 33.1 ms                        | 33.5 ms: 1.01x slower (+1%)        |
    25	+------------------------+--------------------------------+------------------------------------+
    26	| deltablue              | 7.23 ms                        | 7.33 ms: 1.01x slower (+1%)        |
    27	+------------------------+--------------------------------+------------------------------------+
    28	| genshi_text            | 29.1 ms                        | 29.5 ms: 1.01x slower (+1%)        |
    29	+------------------------+--------------------------------+------------------------------------+
    30	| spectral_norm          | 124 ms                         | 126 ms: 1.02x slower (+2%)         |
    31	+------------------------+--------------------------------+------------------------------------+
    32	| pickle_pure_python     | 455 us                         | 462 us: 1.02x slower (+2%)         |
    33	+------------------------+--------------------------------+------------------------------------+
    34	| richards               | 72.4 ms                        | 73.7 ms: 1.02x slower (+2%)        |
    35	+------------------------+--------------------------------+------------------------------------+
    36	| regex_dna              | 161 ms                         | 165 ms: 1.02x slower (+2%)         |
    37	+------------------------+--------------------------------+------------------------------------+
    38	| regex_compile          | 180 ms                         | 184 ms: 1.02x slower (+2%)         |
    39	+------------------------+--------------------------------+------------------------------------+
    40	| tornado_http           | 154 ms                         | 157 ms: 1.02x slower (+2%)         |
    41	+------------------------+--------------------------------+------------------------------------+
    42	| genshi_xml             | 60.2 ms                        | 61.6 ms: 1.02x slower (+2%)        |
    43	+------------------------+--------------------------------+------------------------------------+
    44	| sqlalchemy_declarative | 159 ms                         | 163 ms: 1.02x slower (+2%)         |
    45	+------------------------+--------------------------------+------------------------------------+
    46	| logging_format         | 10.1 us                        | 10.3 us: 1.02x slower (+2%)        |
    47	+------------------------+--------------------------------+------------------------------------+
    48	| float                  | 113 ms                         | 116 ms: 1.03x slower (+3%)         |
    49	+------------------------+--------------------------------+------------------------------------+
    50	| hexiom                 | 10.0 ms                        | 10.3 ms: 1.03x slower (+3%)        |
    51	+------------------------+--------------------------------+------------------------------------+
    52	| mako                   | 17.1 ms                        | 17.6 ms: 1.03x slower (+3%)        |
    53	+------------------------+--------------------------------+------------------------------------+
    54	| json_loads             | 23.7 us                        | 24.4 us: 1.03x slower (+3%)        |
    55	+------------------------+--------------------------------+------------------------------------+
    56	| python_startup_no_site | 6.76 ms                        | 6.98 ms: 1.03x slower (+3%)        |
    57	+------------------------+--------------------------------+------------------------------------+
    58	| unpickle_list          | 4.01 us                        | 4.15 us: 1.03x slower (+3%)        |
    59	+------------------------+--------------------------------+------------------------------------+
    60	| pickle                 | 8.77 us                        | 9.09 us: 1.04x slower (+4%)        |
    61	+------------------------+--------------------------------+------------------------------------+
    62	| scimark_monte_carlo    | 101 ms                         | 105 ms: 1.04x slower (+4%)         |
    63	+------------------------+--------------------------------+------------------------------------+
    64	| unpickle_pure_python   | 361 us                         | 376 us: 1.04x slower (+4%)         |
    65	+------------------------+--------------------------------+------------------------------------+
    66	| chameleon              | 9.28 ms                        | 9.70 ms: 1.05x slower (+5%)        |
    67	+------------------------+--------------------------------+------------------------------------+
    68	| json_dumps             | 11.4 ms                        | 11.9 ms: 1.05x slower (+5%)        |
    69	+------------------------+--------------------------------+------------------------------------+
    70	| telco                  | 5.50 ms                        | 5.77 ms: 1.05x slower (+5%)        |
    71	+------------------------+--------------------------------+------------------------------------+
    72	| xml_etree_iterparse    | 96.2 ms                        | 101 ms: 1.05x slower (+5%)         |
    73	+------------------------+--------------------------------+------------------------------------+
    74	| fannkuch               | 479 ms                         | 504 ms: 1.05x slower (+5%)         |
    75	+------------------------+--------------------------------+------------------------------------+
    76	| sqlite_synth           | 2.59 us                        | 2.73 us: 1.05x slower (+5%)        |
    77	+------------------------+--------------------------------+------------------------------------+
    78	| xml_etree_generate     | 99.2 ms                        | 105 ms: 1.06x slower (+6%)         |
    79	+------------------------+--------------------------------+------------------------------------+
    80	| xml_etree_parse        | 127 ms                         | 138 ms: 1.09x slower (+9%)         |
    81	+------------------------+--------------------------------+------------------------------------+
    82	
    83	Not significant (21): 2to3; logging_simple; sympy_expand; sympy_sum; crypto_pyaes; regex_v8; django_template; pickle_list; logging_silent; raytrace; nqueens; regex_effbot; scimark_lu; unpack_sequence; pidigits; scimark_sor; meteor_contest; scimark_sparse_mat_mult; python_startup; pathlib; nbody

So 7/60 benchmarks get faster; 21/60 no change; 32/60 get slower. The median is 1% slower, and the quartiles 3% slower to 0%. The extremes are 9% slower and 4% faster.

OTOH without PGO, starting from a baseline of the expression in nixpkgs master, LTO is quite helpful:

$ python -m pyperf compare_to py37-nix{,.lto}.json --table -G | nl -ba
     1	+-------------------------+----------+-------------------------------+
     2	| Benchmark               | py37-nix | py37-nix.lto                  |
     3	+=========================+==========+===============================+
     4	| pickle_dict             | 39.9 us  | 34.2 us: 1.17x faster (-14%)  |
     5	+-------------------------+----------+-------------------------------+
     6	| pickle_list             | 4.97 us  | 4.38 us: 1.14x faster (-12%)  |
     7	+-------------------------+----------+-------------------------------+
     8	| xml_etree_process       | 111 ms   | 101 ms: 1.10x faster (-9%)    |
     9	+-------------------------+----------+-------------------------------+
    10	| scimark_monte_carlo     | 134 ms   | 123 ms: 1.09x faster (-8%)    |
    11	+-------------------------+----------+-------------------------------+
    12	| xml_etree_iterparse     | 114 ms   | 106 ms: 1.08x faster (-7%)    |
    13	+-------------------------+----------+-------------------------------+
    14	| scimark_fft             | 390 ms   | 362 ms: 1.08x faster (-7%)    |
    15	+-------------------------+----------+-------------------------------+
    16	| xml_etree_generate      | 131 ms   | 122 ms: 1.07x faster (-7%)    |
    17	+-------------------------+----------+-------------------------------+
    18	| mako                    | 23.3 ms  | 21.8 ms: 1.07x faster (-7%)   |
    19	+-------------------------+----------+-------------------------------+
    20	| pyflate                 | 832 ms   | 777 ms: 1.07x faster (-7%)    |
    21	+-------------------------+----------+-------------------------------+
    22	| float                   | 133 ms   | 124 ms: 1.07x faster (-6%)    |
    23	+-------------------------+----------+-------------------------------+
    24	| python_startup_no_site  | 7.75 ms  | 7.25 ms: 1.07x faster (-6%)   |
    25	+-------------------------+----------+-------------------------------+
    26	| sqlite_synth            | 3.33 us  | 3.12 us: 1.07x faster (-6%)   |
    27	+-------------------------+----------+-------------------------------+
    28	| richards                | 101 ms   | 94.3 ms: 1.07x faster (-6%)   |
    29	+-------------------------+----------+-------------------------------+
    30	| json_dumps              | 14.4 ms  | 13.5 ms: 1.07x faster (-6%)   |
    31	+-------------------------+----------+-------------------------------+
    32	| sympy_expand            | 734 ms   | 688 ms: 1.07x faster (-6%)    |
    33	+-------------------------+----------+-------------------------------+
    34	| pickle                  | 13.2 us  | 12.4 us: 1.07x faster (-6%)   |
    35	+-------------------------+----------+-------------------------------+
    36	| unpickle_list           | 4.25 us  | 3.99 us: 1.07x faster (-6%)   |
    37	+-------------------------+----------+-------------------------------+
    38	| go                      | 337 ms   | 317 ms: 1.07x faster (-6%)    |
    39	+-------------------------+----------+-------------------------------+
    40	| nbody                   | 138 ms   | 130 ms: 1.06x faster (-6%)    |
    41	+-------------------------+----------+-------------------------------+
    42	| sympy_integrate         | 35.2 ms  | 33.2 ms: 1.06x faster (-6%)   |
    43	+-------------------------+----------+-------------------------------+
    44	| scimark_lu              | 227 ms   | 215 ms: 1.06x faster (-5%)    |
    45	+-------------------------+----------+-------------------------------+
    46	| scimark_sor             | 243 ms   | 230 ms: 1.06x faster (-5%)    |
    47	+-------------------------+----------+-------------------------------+
    48	| unpickle                | 16.0 us  | 15.2 us: 1.06x faster (-5%)   |
    49	+-------------------------+----------+-------------------------------+
    50	| xml_etree_parse         | 146 ms   | 139 ms: 1.05x faster (-5%)    |
    51	+-------------------------+----------+-------------------------------+
    52	| pidigits                | 173 ms   | 164 ms: 1.05x faster (-5%)    |
    53	+-------------------------+----------+-------------------------------+
    54	| python_startup          | 10.3 ms  | 9.81 ms: 1.05x faster (-5%)   |
    55	+-------------------------+----------+-------------------------------+
    56	| raytrace                | 678 ms   | 645 ms: 1.05x faster (-5%)    |
    57	+-------------------------+----------+-------------------------------+
    58	| sqlalchemy_declarative  | 188 ms   | 179 ms: 1.05x faster (-5%)    |
    59	+-------------------------+----------+-------------------------------+
    60	| deltablue               | 10.1 ms  | 9.65 ms: 1.05x faster (-5%)   |
    61	+-------------------------+----------+-------------------------------+
    62	| regex_compile           | 225 ms   | 215 ms: 1.04x faster (-4%)    |
    63	+-------------------------+----------+-------------------------------+
    64	| nqueens                 | 117 ms   | 112 ms: 1.04x faster (-4%)    |
    65	+-------------------------+----------+-------------------------------+
    66	| unpickle_pure_python    | 480 us   | 460 us: 1.04x faster (-4%)    |
    67	+-------------------------+----------+-------------------------------+
    68	| meteor_contest          | 113 ms   | 109 ms: 1.04x faster (-4%)    |
    69	+-------------------------+----------+-------------------------------+
    70	| unpack_sequence         | 47.8 ns  | 46.0 ns: 1.04x faster (-4%)   |
    71	+-------------------------+----------+-------------------------------+
    72	| 2to3                    | 383 ms   | 369 ms: 1.04x faster (-4%)    |
    73	+-------------------------+----------+-------------------------------+
    74	| logging_format          | 13.7 us  | 13.2 us: 1.04x faster (-4%)   |
    75	+-------------------------+----------+-------------------------------+
    76	| pickle_pure_python      | 600 us   | 579 us: 1.04x faster (-4%)    |
    77	+-------------------------+----------+-------------------------------+
    78	| sympy_sum               | 282 ms   | 272 ms: 1.04x faster (-3%)    |
    79	+-------------------------+----------+-------------------------------+
    80	| crypto_pyaes            | 130 ms   | 126 ms: 1.04x faster (-3%)    |
    81	+-------------------------+----------+-------------------------------+
    82	| logging_silent          | 232 ns   | 224 ns: 1.04x faster (-3%)    |
    83	+-------------------------+----------+-------------------------------+
    84	| genshi_text             | 37.2 ms  | 36.0 ms: 1.03x faster (-3%)   |
    85	+-------------------------+----------+-------------------------------+
    86	| sympy_str               | 457 ms   | 442 ms: 1.03x faster (-3%)    |
    87	+-------------------------+----------+-------------------------------+
    88	| chaos                   | 149 ms   | 144 ms: 1.03x faster (-3%)    |
    89	+-------------------------+----------+-------------------------------+
    90	| fannkuch                | 548 ms   | 531 ms: 1.03x faster (-3%)    |
    91	+-------------------------+----------+-------------------------------+
    92	| regex_effbot            | 2.90 ms  | 2.81 ms: 1.03x faster (-3%)   |
    93	+-------------------------+----------+-------------------------------+
    94	| telco                   | 7.27 ms  | 7.05 ms: 1.03x faster (-3%)   |
    95	+-------------------------+----------+-------------------------------+
    96	| json_loads              | 28.5 us  | 27.6 us: 1.03x faster (-3%)   |
    97	+-------------------------+----------+-------------------------------+
    98	| django_template         | 77.4 ms  | 75.1 ms: 1.03x faster (-3%)   |
    99	+-------------------------+----------+-------------------------------+
   100	| pathlib                 | 23.4 ms  | 22.8 ms: 1.03x faster (-3%)   |
   101	+-------------------------+----------+-------------------------------+
   102	| genshi_xml              | 79.0 ms  | 77.0 ms: 1.03x faster (-2%)   |
   103	+-------------------------+----------+-------------------------------+
   104	| logging_simple          | 12.2 us  | 11.9 us: 1.02x faster (-2%)   |
   105	+-------------------------+----------+-------------------------------+
   106	| dulwich_log             | 83.8 ms  | 81.9 ms: 1.02x faster (-2%)   |
   107	+-------------------------+----------+-------------------------------+
   108	| hexiom                  | 13.0 ms  | 12.7 ms: 1.02x faster (-2%)   |
   109	+-------------------------+----------+-------------------------------+
   110	| tornado_http            | 184 ms   | 180 ms: 1.02x faster (-2%)    |
   111	+-------------------------+----------+-------------------------------+
   112	| regex_v8                | 23.3 ms  | 22.9 ms: 1.02x faster (-2%)   |
   113	+-------------------------+----------+-------------------------------+
   114	| regex_dna               | 165 ms   | 167 ms: 1.01x slower (+1%)    |
   115	+-------------------------+----------+-------------------------------+
   116	| scimark_sparse_mat_mult | 4.52 ms  | 19.7 ms: 4.36x slower (+336%) |
   117	+-------------------------+----------+-------------------------------+
   118	
   119	Not significant (3): chameleon; sqlalchemy_imperative; spectral_norm

Median 4% faster; quartiles 3% to 7% faster; extremes 1% slower to 17% faster (ignoring one benchmark reading >4x slower, where the timing clearly hiccupped.)

So if you were comparing --with-lto to the unmodified nixpkgs build, then your result is in the range of what I'm seeing in the benchmark suite.

If you were comparing --with-lto --enable-optimizations to just --enable-optimizations, then that is an interesting contrast! I'd be curious what you get if you take the same two builds where you found one interpreter ran 15% faster than the other on your workload, and try them out on pyperformance. That'd help resolve whether the story is that none of the benchmarks represent your workload, or that some other variable (about the exact builds, our respective machines, etc.) causes you to get different results from me even on the same benchmarks.

@dnadlinger
Copy link
Contributor

dnadlinger commented Apr 3, 2020

@gnprice: Many thanks for the detailed (!) response! Let's just go for --enable-optimizations for the time being, as this is also unambiguously recommended by upstream.

I'd love to post more detailed data, but unfortunately, it will take me a while to access the machine where I ran the benchmarks due to lockdown-related silliness.

@FRidh FRidh added this to Needs review in Staging Apr 3, 2020
@FRidh FRidh self-assigned this Apr 21, 2020
@@ -0,0 +1,30 @@
Call the linker with -O1 -Bsymbolic-functions.

Based on debian/patches/link-opt.diff in the Debian package `python3`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a link to salsa and list the changes from the Debian patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Here's the patch on salsa (as of latest master):
https://salsa.debian.org/cpython-team/python3/-/blob/02735af38/debian/patches/link-opt.diff

This patch has exactly the same content -- the only change is in the surrounding context.

Similarly, within Debian, the patch's content has been unchanged (the only changes to the patch file have been to update the context and line numbers) since LDCXXSHARED was introduced to the upstream version of this code in Python 3.2 (and 2.7). Here's that patch as it appeared for python3.2 in Debian 7 Wheezy, released in 2013:
https://sources.debian.org/src/python3.2/3.2.3-7/debian/patches/link-opt.diff/

And here it is from the python2.4 in Debian 5 Lenny, released in 2009:
https://sources.debian.org/src/python2.4/2.4.6-1+lenny1/debian/patches/link-opt.dpatch/
That one already has the same content except for the absence of LDCXXSHARED.

I'll add into the patch files the Salsa link, and the clarification of "based on" to "has the same content".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Oh, and the Debian versions of the patch apply to configure.ac, or on older CPython versions configure.in, because Debian makes a point of regenerating configure from those.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regenerating configure is not a bad idea either, just add the hook.

Copy link
Contributor Author

@gnprice gnprice May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat -- searching a bit, I guess that'd be autoreconfHook? And we could enable it by just adding it to nativeBuildInputs, IIUC. Cool.

Happy to add a commit making that change too. I won't do that tonight, only because testing properly causes rebuilds of GCC etc. that take over an hour, and it's too late in the evening here for another round of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the patches from Debian apply, it is better to fetch them using fetchpatch.

OK, can do.

Any preference on fetching from Salsa vs. sources.debian.org? (The latter being a record of actual releases.)

I think I can find all 4 of the needed versions of the patch (the context keeps changing slightly) on Salsa, and 3 of the 4 on sources -- Debian skipped Python 3.6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Salsa is better since it is a git repository – we can pin it to a commit and it will always be there. Sources might remove older debian releases after they are no longer supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, that's a good consideration. On that score, I think actually sources is more dependable -- it isn't the live repository for active releases, rather is meant as a permanent archive for studying long-term history. It has all Debian releases going back to 2.0 "hamm", released in 1998:
https://sources.debian.org/stats/

These patches are probably pretty stable in their URLs on Salsa, too. But they could go away -- the package's maintainer, or a future new maintainer, could move its development to some other repo and remove this one. If it's simply moved within Salsa there'd probably be a redirect, but they could also move it to some other Git hosting service or even something else. Anything the particular maintainer finds convenient for ongoing development would be seen as 100% OK from a Debian perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, sources only carry the latest revision they have for a given Debian release:

https://sources.debian.org/src/python3.2/

Maybe there is some archive site but I do not see revision -6 in the above link.

Salsa should be stable enough, we can always deal with breakage like we did when they switched from Alioth to Salsa. And Hydra caches the sources so even if they switch, users will still get it from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, sources only carry the latest revision they have for a given Debian release:

Hmm, true. I'd missed that point, thanks. That means nothing will change for any release that's no longer supported, but it can for a release that is (or is still in development).

https://sources.debian.org/src/python3.2/
Maybe there is some archive site but I do not see revision -6 in the above link.

For an even sharper example:
https://sources.debian.org/src/python3.5/

The version available is 3.5.3-1+deb9u1, from Debian 9 "stretch". The "+debXuY" suffix means that was an update after the release was stable, which means the previous version 3.5.3-1 was the one that in the original stable release of "stretch". But that original version doesn't appear on sources. If there's another update to that package in "stretch" in its remaining months before EOL, then presumably the +deb9u1 version will disappear too.

There is a comprehensive archive site which has all versions:
http://snapshot.debian.org/package/python3.2/
But that one only provides whole package files, not individual files from within the package, so it's not suitable for fetchpatch.

I'll use Salsa. Thanks for the discussion!

And Hydra caches the sources so even if they switch, users will still get it from there.

Oh, good to know. That's quite reassuring for the impact of temporary outages, too.

Without this flag, the configure script prints a warning at the end,
like this (reformatted):

  If you want a release build with all stable optimizations active
  (PGO, etc), please run ./configure --enable-optimizations

We're doing a build to distribute to people for day-to-day use,
doing things other than developing the Python interpreter.  So
that's certainly a release build -- we're the target audience for
this recommendation.

---

And, trying it out, upstream isn't kidding!  I ran the standard
benchmark suite that the CPython developers use for performance
work, "pyperformance".  Following its usage instructions:
  https://pyperformance.readthedocs.io/usage.html
I ran the whole suite, like so:

  $ nix-shell -p ./result."$variant" --run '
      cd $(mktemp -d); python -m venv venv; . venv/bin/activate
      pip install pyperformance
      pyperformance run -o ~/tmp/result.'"$variant"'.json
    '

and then examined the results with commands like:

  $ python -m pyperf compare_to --table -G \
      ~/tmp/result.{$before,$after}.json

Across all the benchmarks in the suite, the median speedup was 16%.
(Meaning 1.16x faster; 14% less time).

The middle half of them ranged from a 13% to a 22% speedup.

Each of the 60 benchmarks in the suite got faster, by speedups
ranging from 3% to 53%.

---

One reason this isn't just the default to begin with is that, until
recently, it made the build a lot slower.  What it does is turn on
profile-guided optimization, which means first build for profiling,
then run some task to get a profile, then build again using the
profile.  And, short of further customization, the task it would use
would be nearly the full test suite, which includes a lot of
expensive and slow tests, and can easily take half an hour to run.

Happily, in 2019 an upstream developer did the work to carefully
select a more appropriate set of tests to use for the profile:
  python/cpython@4e16a4a31
  https://bugs.python.org/issue36044
This suite takes just 2 minutes to run.  And the resulting final
build is actually slightly faster than with the much longer suite,
at least as measured by those standard "pyperformance" benchmarks.
That work went into the 3.8 release, but the same list works great
if used on older releases too.

So, start passing that --enable-optimizations flag; and backport
that good-for-PGO set of tests, so that we use it on all releases.
@gnprice
Copy link
Contributor Author

gnprice commented May 12, 2020

gnprice requested review from edolstra, Infinisil, jonringer, nbp and Profpatsch as code owners 1 minute ago

Sorry about this bit -- I accidentally rebased the branch onto current master. Undid that, but it seems the code-owners review requests persist.

@gnprice
Copy link
Contributor Author

gnprice commented May 12, 2020

Thanks @FRidh and @jtojnar for the reviews! Pushed a version that makes all the suggested changes from the last round of review. Please take a look.

Tested this new version of this PR on all Python versions -- attributes python39, python38, python37, python36, python35, python27 -- and the build logs show the intended changes are present. I haven't re-run benchmarks, because the changes from the previous version of this PR are only in how the patches are organized and applied, and the resulting build is unchanged.

@ofborg ofborg bot requested a review from FRidh May 12, 2020 08:22
Copy link
Contributor

@drewrisinger drewrisinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff LGTM. Tested by running nixpkgs-review pr 84072 -p python37, and get the following error (seems to be some artifact of my local PC, hence the approval):

/build/ccy2ZovP.s: Assembler messages:
/build/ccy2ZovP.s: Fatal error: can't write 62 bytes to section .text of tree-ssa-phiopt.o: 'No space left on device'
/build/ccy2ZovP.s: Fatal error: can't close tree-ssa-phiopt.o: No space left on device
make[3]: *** [Makefile:1116: tree-ssa-phiopt.o] Error 1
make[3]: *** Waiting for unfinished jobs....
../../gcc-9.3.0/gcc/tree-ssa-math-opts.c:3865:1: fatal error: error writing to /build/ccmoyYZ3.s: No space left on device
 3865 | }
      | ^
compilation terminated.
make[3]: *** [Makefile:1116: tree-ssa-math-opts.o] Error 1
../../gcc-9.3.0/gcc/tree-ssa-reassoc.c:6355:1: fatal error: error writing to /build/ccxL3Bnc.s: No space left on device
 6355 | }
      | ^
compilation terminated.
make[3]: *** [Makefile:1116: tree-ssa-reassoc.o] Error 1
rm gcc.pod
make[3]: Leaving directory '/build/build/gcc'
make[2]: *** [Makefile:4830: all-stageprofile-gcc] Error 2
make[2]: Leaving directory '/build/build'
make[1]: *** [Makefile:22762: stageprofile-bubble] Error 2
make[1]: Leaving directory '/build/build'
make: *** [Makefile:23016: profiledbootstrap] Error 2
builder for '/nix/store/z272zdw8hqcii4bl8l07vzd60qf6y87y-gcc-9.3.0.drv' failed with exit code 2

@gnprice
Copy link
Contributor Author

gnprice commented May 13, 2020

/build/ccy2ZovP.s: Fatal error: can't write 62 bytes to section .text of tree-ssa-phiopt.o: 'No space left on device'
/build/ccy2ZovP.s: Fatal error: can't close tree-ssa-phiopt.o: No space left on device
make[3]: *** [Makefile:1116: tree-ssa-phiopt.o] Error 1
make[3]: *** Waiting for unfinished jobs....
../../gcc-9.3.0/gcc/tree-ssa-math-opts.c:3865:1: fatal error: error writing to /build/ccmoyYZ3.s: No space left on device

Thanks!

Yeah, I find that the rebuild of GCC requires something over 5 GB of free disk space. The bulk of that gets freed as soon as that build is done, it doesn't accumulate; but between that and the stuff that does accumulate, I had to spend some time freeing up space elsewhere on my disk when I started working on this PR.

@drewrisinger
Copy link
Contributor

Yeah, I find that the rebuild of GCC requires something over 5 GB of free disk space. The bulk of that gets freed as soon as that build is done, it doesn't accumulate; but between that and the stuff that does accumulate, I had to spend some time freeing up space elsewhere on my disk when I started working on this PR.

Yeah, I have ~500 GB free, so no idea what's causing this. Maybe a SWAP partition or something. Oh well. Not really relevant.

@andersk
Copy link
Contributor

andersk commented May 13, 2020

@drewrisinger Perhaps you have boot.tmpOnTmpfs = true? Since tmpfs by default is limited to half of physical RAM, you might try systemd.services.nix-daemon.environment.TMPDIR = "/var/tmp";.

@drewrisinger
Copy link
Contributor

@drewrisinger Perhaps you have boot.tmpOnTmpfs = true? Since tmpfs by default is limited to half of physical RAM, you might try systemd.services.nix-daemon.environment.TMPDIR = "/var/tmp";.

I was building on Ubuntu 18.04 LTS, not NixOS. could cause some differences.

In particular this will let us use patches that apply to configure.ac.
I took a close look at how Debian builds the Python interpreter,
because I noticed it ran substantially faster than the one in nixpkgs
and I was curious why.

One thing that I found made a material difference in performance was
this pair of linker flags (passed to the compiler):

    -Wl,-O1 -Wl,-Bsymbolic-functions

In other words, effectively the linker gets passed the flags:

    -O1 -Bsymbolic-functions

Doing the same thing in nixpkgs turns out to make the interpreter
run about 6% faster, which is quite a big win for such an easy
change.  So, let's apply it.

---

I had not known there was a `-O1` flag for the *linker*!
But indeed there is.

These flags are unrelated to "link-time optimization" (LTO), despite
the latter's name.  LTO means doing classic compiler optimizations
on the actual code, at the linking step when it becomes possible to
do them with cross-object-file information.  These two flags, by
contrast, cause the linker to make certain optimizations within the
scope of its job as the linker.

Documentation is here, though sparse:
  https://sourceware.org/binutils/docs-2.31/ld/Options.html

The meaning of -O1 was explained in more detail in this LWN article:
  https://lwn.net/Articles/192624/
Apparently it makes the resulting symbol table use a bigger hash
table, so the load factor is smaller and lookups are faster.  Cool.

As for -Bsymbolic-functions, the documentation indicates that it's a
way of saving lookups through the symbol table entirely.  There can
apparently be situations where it changes the behavior of a program,
specifically if the program relies on linker tricks to provide
customization features:
  https://bugs.launchpad.net/ubuntu/+source/xfe/+bug/644645
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637184#35
But I'm pretty sure CPython doesn't permit that kind of trick: you
don't load a shared object that tries to redefine some symbol found
in the interpreter core.

The stronger reason I'm confident using -Bsymbolic-functions is
safe, though, is empirical.  Both Debian and Ubuntu have been
shipping a Python built this way since forever -- it was introduced
for the Python 2.4 and 2.5 in Ubuntu "hardy", and Debian "lenny",
released in 2008 and 2009.  In those 12 years they haven't seen a
need to drop this flag; and I've been unable to locate any reports
of trouble related to it, either on the Web in general or on the
Debian bug tracker.  (There are reports of a handful of other
programs breaking with it, but not Python/CPython.)  So that seems
like about as thorough testing as one could hope for.

---

As for the performance impact: I ran CPython upstream's preferred
benchmark suite, "pyperformance", in the same way as described in
the previous commit.  On top of that commit's change, the results
across the 60 benchmarks in the suite are:

The median is 6% faster.

The middle half (aka interquartile range) is from 4% to 8% faster.

Out of 60 benchmarks, 3 come out slower, by 1-4%.  At the other end,
5 are at least 10% faster, and one is 17% faster.

So, that's quite a material speedup!  I don't know how big the
effect of these flags is for other software; but certainly CPython
tends to do plenty of dynamic linking, as that's how it loads
extension modules, which are ubiquitous in the stdlib as well as
popular third-party libraries.  So perhaps that helps explain why
optimizing the dynamic linker has such an impact.
@gnprice
Copy link
Contributor Author

gnprice commented May 14, 2020

Thanks @FRidh and @jtojnar for the further reviews! Just pushed a version which I believe makes all the suggested changes.

I've tested this version in the same way as the previous version:

Tested this new version of this PR on all Python versions -- attributes python39, python38, python37, python36, python35, python27 -- and the build logs show the intended changes are present.

One of the patches (for 3.7) comes via plain fetchurl instead of fetchpatch, because the latter produced the following error:

$ nix-build . -A python3Minimal 
error: anonymous function at
  /home/greg/n/nix/pkgs/pkgs/build-support/fetchurl/boot.nix:5:1
  called with unexpected argument 'meta', at
  /home/greg/n/nix/pkgs/pkgs/build-support/fetchpatch/default.nix:18:1
(use '--show-trace' to show detailed location information)

From the discussion in this thread about a similar error:
https://discourse.nixos.org/t/how-to-patch-in-an-overlay/3678
I believe what triggers that is that 3.7 is used in the transitive dependencies of fetchpatch itself.

@FRidh
Copy link
Member

FRidh commented May 14, 2020

Thanks. Right, fetchpatch can't be used in bootstrapping. In that case it is better to include the patch in the tree, because we do not want the checksum to be incorrect suddenly.

@gnprice
Copy link
Contributor Author

gnprice commented May 14, 2020

In that case it is better to include the patch in the tree, because we do not want the checksum to be incorrect suddenly.

Can you describe the scenario in which you're concerned that the checksum might change?

As @jtojnar observed upthread, we're downloading here from a Git repository. It's not asking the server to generate a Git diff between two files (or trees) -- rather it's downloading the blob that's at a specific path in a specific commit's tree. That blob happens to be a diff, which is of interest to us, but it's not something the server (or the Git repo) knows at all; this isn't a case where the server could decide to tweak the diff format.

@doronbehar
Copy link
Contributor

Usually I don't bump, but I can't wait to see the effects of this PR 😋.

@FRidh FRidh merged commit a2be64b into NixOS:staging Jun 4, 2020
Staging automation moved this from Needs review to Done Jun 4, 2020
@FRidh
Copy link
Member

FRidh commented Jun 4, 2020

fetchpatch is causing trouble #89489

@gnprice
Copy link
Contributor Author

gnprice commented Jun 6, 2020

Thanks @FRidh and everyone for the reviews and helping push this through!

I am curious about the trouble. I figure that new thread is the best place to discuss that, so I'll ask there.

@FRidh
Copy link
Member

FRidh commented Dec 30, 2020

PR that, for reproducibilty, disables optimizations again #107965.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/why-is-the-nix-compiled-python-slower/18717/13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Staging
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

10 participants