Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMA in polynomial evaluation #3023

Merged
merged 22 commits into from
Jun 8, 2021
Merged

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Jun 2, 2021

On Kaby Lake, this makes Horner evaluation about 2× faster and Estrin 1.5× faster at high degrees.
The dispatching to the FMA vs. no-FMA implementation is done at the same time as the dispatching on degree, so that, when evaluating polynomials of unknown degree, we immediately jump to the correct evaluation code and not to a branch on FMA availability.

Prior to this change, the call to the (virtual) operator() could not be inlined (see #3018), so that polynomial evaluations even of known low degree were never inline. Now EvaluateWith[out]FMA is the call that cannot be inlined.
We could add an operator() on PolynomialInMonomialBasis that calls EvaluateWith[out]FMA non-virtually so that evaluations of polynomials of small degree may be inlined at the compiler’s discretion; I will investigate that later.
Benchmark results below.

Note that the FORCE_INLINE really want to be on the definitions, otherwise the compiler fails to inline some things, and more importantly then fails at common subexpression elimination for unpcklpd (see #1731 (comment)), making the new implementation with FMA no faster than the old version without.
Before:

.\Release\x64\benchmarks.nofma.exe --benchmark_filter='BM_EvaluatePolynomial.*' --benchmark_repetitions=100 --benchmark_report_aggregates_only=true
2021-06-02T23:37:31+02:00
Running C:\Users\robin\Projects\mockingbirdnest\Principia\Release\x64\benchmarks.nofma.exe
Run on (8 X 2904 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                            Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/4_mean                3499 ns         3484 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/4_median              3455 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/4_stddev               171 ns          661 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/8_stddev              1147 ns         7196 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/12_mean               6734 ns         6641 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/12_median             6467 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/12_stddev             1160 ns          926 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_mean               7741 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_median             7616 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_stddev              682 ns         7852 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/4_mean              3648 ns         3750 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/4_median            3421 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/4_stddev             658 ns         6707 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/8_mean              4797 ns         4734 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/8_median            4649 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/8_stddev             617 ns          683 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/12_mean             6810 ns         6516 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/12_median           6330 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/12_stddev           1715 ns          917 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/16_median           7952 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/4_mean          5453 ns         5438 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/4_median        5384 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/8_mean          9093 ns         9375 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/8_median        8122 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/8_stddev        2936 ns        37294 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/12_mean        11856 ns        12031 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/12_median      11559 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/12_stddev       1513 ns         6609 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/16_mean        14433 ns        14063 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/16_median      14283 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/16_stddev        921 ns         4711 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4_mean          5512 ns         5484 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4_median        5440 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4_stddev         273 ns          785 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8_mean          8744 ns         8750 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8_median        8559 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8_stddev        1091 ns         7795 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12_mean        11524 ns        11562 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12_median      11241 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12_stddev        934 ns         6888 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16_mean        14724 ns        14688 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16_median      14457 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16_stddev       1352 ns         4341 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/4_mean                3510 ns         3500 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/4_median              3472 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/4_stddev               148 ns          671 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/8_mean                5688 ns         5656 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/8_median              5604 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/8_stddev               366 ns          794 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/12_mean               9017 ns         8844 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/12_median             8928 ns         9375 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/12_stddev              285 ns          894 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/16_mean              13354 ns        13438 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/16_median            13086 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/16_stddev             1106 ns         5884 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/4_mean              3742 ns         3719 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/4_median            3688 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/4_stddev             262 ns          794 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/8_mean              6097 ns         5938 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/8_median            5917 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/8_stddev             802 ns         7622 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/12_mean             9510 ns         9531 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/12_median           9407 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/12_stddev            806 ns         7659 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/16_mean            13928 ns        13906 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/16_median          13838 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/16_stddev            842 ns         4914 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/4_mean          5229 ns         5312 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/4_median        5041 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/4_stddev         673 ns         7439 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/8_mean         11035 ns        11094 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/8_median       10255 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/8_stddev        1877 ns         7126 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/12_mean        17063 ns        17188 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/12_median      16755 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/12_stddev       1397 ns         4711 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/16_mean        25351 ns        25156 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/16_median      25017 ns        31250 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/16_stddev       1183 ns         7659 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4_mean          5271 ns         5266 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4_median        5171 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4_stddev         262 ns          758 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8_mean         10525 ns        10625 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8_median       10214 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8_stddev        1316 ns         7325 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12_mean        17091 ns        17031 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12_median      16701 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12_stddev       1369 ns         4494 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16_mean        25442 ns        25313 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16_median      25208 ns        31250 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16_stddev       1372 ns         7622 ns          100

After:

.\Release\x64\benchmarks.exe --benchmark_filter='BM_EvaluatePolynomial.*' --benchmark_repetitions=100 --benchmark_report_aggregates_only=true
2021-06-03T00:38:41+02:00
Running C:\Users\robin\Projects\mockingbirdnest\Principia\Release\x64\benchmarks.exe
Run on (8 X 2904 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                            Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/4_mean                3128 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/4_median              3110 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/4_stddev               142 ns          628 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/8_stddev               545 ns         7051 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/12_mean               5201 ns         5203 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/12_median             5176 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/12_stddev              150 ns          738 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_mean               6487 ns         6406 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_median             6174 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_stddev              702 ns         7724 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/4_mean              3528 ns         3531 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/4_median            3518 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/4_stddev            99.0 ns          689 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/8_mean              4598 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/8_median            4385 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/8_stddev             551 ns         7196 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/12_mean             5522 ns         5531 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/12_median           5525 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/12_stddev            183 ns          783 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/16_mean             6717 ns         6719 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/16_median           6492 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<EstrinEvaluator>/16_stddev            569 ns         7775 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/4_mean          4258 ns         4250 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/4_median        4234 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/4_stddev         167 ns          705 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/8_mean          6283 ns         6281 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/8_median        6250 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/8_stddev         223 ns          543 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/12_mean         7942 ns         7953 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/12_median       7926 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/12_stddev        166 ns          501 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/16_mean         9636 ns         9688 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/16_median       9488 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<EstrinEvaluator>/16_stddev        685 ns         7622 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4_mean          4253 ns         4250 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4_median        4246 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4_stddev         112 ns          705 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8_mean          6259 ns         6266 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8_median        6227 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8_stddev         159 ns          156 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12_mean         7682 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12_median       7483 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12_stddev        602 ns         7852 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16_mean         9706 ns         9672 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16_median       9613 ns         9375 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16_stddev        516 ns          726 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/4_mean                3162 ns         3172 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/4_median              3157 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/4_stddev              97.0 ns          413 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/8_mean                3982 ns         3984 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/8_median              3929 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/8_stddev               256 ns          781 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/12_mean               5490 ns         5500 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/12_median             5459 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/12_stddev              168 ns          785 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/16_mean               7644 ns         7656 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/16_median             7649 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<HornerEvaluator>/16_stddev              154 ns          471 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/4_mean              3118 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/4_median            3096 ns         3125 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/4_stddev             171 ns          314 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/8_mean              4282 ns         4297 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/8_median            4247 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/8_stddev             148 ns          715 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/12_mean             5998 ns         6000 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/12_median           6010 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/12_stddev            123 ns          617 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/16_mean             8183 ns         8188 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/16_median           8171 ns         7812 ns          100
BM_EvaluatePolynomialInMonomialBasisQuantity<HornerEvaluator>/16_stddev            172 ns          671 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/4_mean          3922 ns         3938 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/4_median        3915 ns         4688 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/4_stddev         125 ns          785 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/8_mean          6215 ns         6156 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/8_median        6176 ns         6250 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/8_stddev         330 ns          536 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/12_mean         9248 ns         9375 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/12_median       9069 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/12_stddev        753 ns         7693 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/16_mean        11670 ns        11719 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/16_median      11711 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisVectorDouble<HornerEvaluator>/16_stddev        672 ns         6800 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4_mean          3900 ns         3906 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4_median        3887 ns         3906 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4_stddev         114 ns          785 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8_mean          6151 ns         6094 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8_median        5858 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8_stddev         590 ns         7659 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12_mean         9385 ns         9375 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12_median       9192 ns        15625 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12_stddev       1065 ns         7693 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16_mean        11660 ns        11641 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16_median      11655 ns        10938 ns          100
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16_stddev        203 ns          842 ns          100

Argument const& argument) const = 0;
Value operator()(Argument const& argument) const;
Derivative<Value, Argument> EvaluateDerivative(
Argument const& argument) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite unhappy about this change. Polynomial was supposed to be a pure interface used for heterogeneous storage in containers. Having the FMA dispatching happen here looks like a code smell to me: conceptually we should be able to express polynomials in any base, and it's not clear that the FMA optimization is beneficial in all cases.

I would leave this class alone and do the FMA dispatching in PolynomialInMonomialBasis (in which case the with/without FMA functions don't need to be virtual, of course).

@@ -27,8 +27,11 @@ struct EstrinEvaluator {
degree,
internal_polynomial_evaluators::EstrinEvaluator>::Coefficients;

template<bool fma>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the fma bit in the main template parameter part of EstrinEvaluator (same for Horner). No point in multiplying the levels of templating, it's not like the syntax is beautiful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to hide this implementation detail from the signature of the evaluator class, which is part of the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. That would be ugly.

auto const b =
InternalEstrinEvaluator<Value, Argument, degree, low, m - 1, fma>::
Evaluate(coefficients, argument, argument_squares);
if constexpr (fma) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be tempting to pass a pointer-to-function instead of the disgraceful fma bit, and have a helper function to compute a * b + c in the non-FMA case. Unfortunately, that's probably hard because writing the parameter types for that function would be hard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and naming the « multiply-and-add, maybe fused but maybe not » operation seems weird—the FMA calls are already ugly but at least here they parallel a nice a * b + c.

@pleroy
Copy link
Member

pleroy commented Jun 3, 2021

Unrelated to your change, but the benchmarks have some weirdness:

BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/4_stddev               142 ns          628 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/8_stddev               545 ns         7051 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/12_mean               5201 ns         5203 ns          100

What happened to 8_mean and 8_median?

BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_mean               6487 ns         6406 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_median             6174 ns        0.000 ns          100
BM_EvaluatePolynomialInMonomialBasisDouble<EstrinEvaluator>/16_stddev              702 ns         7724 ns          100

Median CPU time of 0? Really?

@eggrobin
Copy link
Member Author

eggrobin commented Jun 3, 2021

The CPU time counter from the OS is noisy anyway, since we are not doing I/O etc. we are better looking at time (or rtdsc)

PolynomialInMonomialBasis<Value_, Argument_, degree_, Evaluator>::operator()(
Argument const& argument) const {
return UseHardwareFMA
? Evaluator<Value, Difference<Argument>, degree_>::Evaluate<
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give a name to Evaluator<...> and use it in both branch to reduce the visual noise. (Several places.)

@pleroy
Copy link
Member

pleroy commented Jun 6, 2021

retest this please

@pleroy pleroy added the LGTM label Jun 6, 2021
@eggrobin eggrobin closed this Jun 7, 2021
@eggrobin eggrobin reopened this Jun 7, 2021
template<typename Value, typename Argument, int degree, bool fma>
FORCE_INLINE(inline) Derivative<Value, Argument, degree>
InternalHornerEvaluator<Value, Argument, degree, fma, degree>::
EvaluateDerivative(Coefficients const& coefficients,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if the old formatting would work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.

@eggrobin eggrobin merged commit 11ae570 into mockingbirdnest:master Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants