Implement some operations of R3Element using intrinsics #1722

pleroy · 2018-02-15T21:27:09Z

According to IACA the throughput of these operations goes from 3.00 to 2.00 cycles on Haswell, and this is reflected by a gain of up to 40% on the benchmarks of the polynomials. As expected, the effect is more pronounced for Estrin evaluation.

Before:

------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time           CPU Iterations
------------------------------------------------------------------------------------------------------------
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4        5502 ns       5616 ns     100000
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8       10087 ns      10013 ns      74786
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12      13518 ns      13455 ns      49857
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16      19715 ns      19608 ns      37393
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4        5124 ns       5148 ns     100000
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8       10750 ns      10708 ns      64102
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12      16643 ns      16062 ns      40792
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16      24517 ns      24475 ns      28045

After:

------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time           CPU Iterations
------------------------------------------------------------------------------------------------------------
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4        4041 ns       4068 ns     172583
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8        7448 ns       7509 ns     112179
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12       9890 ns      10013 ns      74786
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16      13873 ns      13907 ns      56089
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4        4393 ns       4381 ns     160255
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8        9637 ns       9804 ns      74786
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12      15384 ns      15332 ns      49857
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16      22898 ns      22424 ns      29914

ts826848 · 2018-02-15T23:38:41Z

Clang on macOS can't find <intrin.h>. Including <x86intrin.h> seems to do the trick, though.

eggrobin · 2018-02-16T00:07:09Z

We should probably just use <emmintrin.h>, see https://stackoverflow.com/questions/11228855/header-files-for-x86-simd-intrinsics.

pleroy · 2018-02-17T10:14:14Z

There is no x86intrin.h on MSVC. I am going for nmmintrin.h (n as in Nehalem) which hopefully exists everywhere. This gives us all of SSE and none of AVX, which is what we want at this point.

eggrobin · 2018-02-17T11:13:27Z

geometry/r3_element_body.hpp

@@ -33,6 +39,10 @@ R3Element<Scalar>::R3Element(Scalar const& x,
                             Scalar const& y,
                             Scalar const& z) : x(x), y(y), z(z) {}


In the body of the constructor:

static_assert(std::is_standard_layout<R3Element>::value, "blah");

This should ensure that the union is safe, because of the rules pertaining to standard-layout unions (or at least safer; __m128d is heavily implementation-defined, so it could in theory fail to be layout-compatible with doubles).

eggrobin · 2018-02-17T11:28:21Z

geometry/r3_element_body.hpp

+#if PRINCIPIA_USE_SSE2_INTRINSICS
+  __m128d const left_128d = ToM128D(left);
+  return R3Element<Scalar>(_mm_mul_pd(left_128d, right.xy),
+                           _mm_mul_sd(left_128d, right.zt));


It seems nicer to do

_mm_mul_sd(right.zt, left_128d)

for the second one, so that the result has right.t; in particular, if we wanted to ensure that t=0 this would preserve the invariant.

It may make sense to use _mm_set_sd(left) (and a suitable wrapper in the Quantity case) as the operand for the scalar multiplication, so that we are not waiting for the unpack generated from _mm_set1_pd.

Moved the constant argument to the right, although it doesn't seem to have much effect. Not changing to _mm_set_sd as it burns an extra register.

eggrobin · 2018-02-17T11:29:15Z

geometry/r3_element_body.hpp

+  __m128d const left_128d = ToM128D(left);
+  return R3Element<Product<Quantity<LDimension>, RScalar>>(
+      _mm_mul_pd(left_128d, right.xy),
+      _mm_mul_sd(left_128d, right.zt));


Same as above.

pleroy added 15 commits January 14, 2018 14:32

Trying intrinsics.

14ae96c

Better performance with unions.

e304267

Merge branch 'master' into Intrinsics

1c6f90d

Rewrote many things with intrinsics, unsure if == and Dot make sense.

01d57a5

Merge branch 'master' into Intrinsics

d728075

Include path for IACA.

4900933

Optimizations that break everything.

8b91736

Comparisons that work.

6273611

Dot product.

2aaeef2

Norm2

07c4cd3

Test tolerances and cleanup.

e7593c4

Cleanup.

dfb94fc

Avoid load1_pd.

8eb55ee

Simpler Norm² and cleanup

34f5634

Quantities conversions.

b241dfa

pleroy added 3 commits February 16, 2018 20:04

IACA for polynomials.

2bbefcc

Merge branch 'master' into Intrinsics

e00b009

Include cleanup and IACA.

efe94c0

Lint.

aa2ec37

eggrobin approved these changes Feb 17, 2018

View reviewed changes

eggrobin added the LGTM label Feb 17, 2018

pleroy added 3 commits February 17, 2018 13:26

After egg's review.

4395ae7

0

bce1815

IACA

cffde33

pleroy merged commit 66f8f20 into mockingbirdnest:master Feb 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement some operations of R3Element using intrinsics #1722

Implement some operations of R3Element using intrinsics #1722

pleroy commented Feb 15, 2018

ts826848 commented Feb 15, 2018

eggrobin commented Feb 16, 2018

pleroy commented Feb 17, 2018

eggrobin Feb 17, 2018

pleroy Feb 17, 2018

eggrobin Feb 17, 2018

eggrobin Feb 17, 2018

pleroy Feb 17, 2018

eggrobin Feb 17, 2018

		@@ -33,6 +39,10 @@ R3Element<Scalar>::R3Element(Scalar const& x,
		Scalar const& y,
		Scalar const& z) : x(x), y(y), z(z) {}

Implement some operations of R3Element using intrinsics #1722

Implement some operations of R3Element using intrinsics #1722

Conversation

pleroy commented Feb 15, 2018

ts826848 commented Feb 15, 2018

eggrobin commented Feb 16, 2018

pleroy commented Feb 17, 2018

eggrobin Feb 17, 2018

Choose a reason for hiding this comment

pleroy Feb 17, 2018

Choose a reason for hiding this comment

eggrobin Feb 17, 2018

Choose a reason for hiding this comment

eggrobin Feb 17, 2018

Choose a reason for hiding this comment

pleroy Feb 17, 2018

Choose a reason for hiding this comment

eggrobin Feb 17, 2018

Choose a reason for hiding this comment