Skip to content

Implement some operations of R3Element using intrinsics #1722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Feb 17, 2018

Conversation

pleroy
Copy link
Member

@pleroy pleroy commented Feb 15, 2018

According to IACA the throughput of these operations goes from 3.00 to 2.00 cycles on Haswell, and this is reflected by a gain of up to 40% on the benchmarks of the polynomials. As expected, the effect is more pronounced for Estrin evaluation.

Before:

------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time           CPU Iterations
------------------------------------------------------------------------------------------------------------
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4        5502 ns       5616 ns     100000
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8       10087 ns      10013 ns      74786
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12      13518 ns      13455 ns      49857
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16      19715 ns      19608 ns      37393
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4        5124 ns       5148 ns     100000
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8       10750 ns      10708 ns      64102
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12      16643 ns      16062 ns      40792
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16      24517 ns      24475 ns      28045

After:

------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time           CPU Iterations
------------------------------------------------------------------------------------------------------------
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/4        4041 ns       4068 ns     172583
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/8        7448 ns       7509 ns     112179
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/12       9890 ns      10013 ns      74786
BM_EvaluatePolynomialInMonomialBasisDisplacement<EstrinEvaluator>/16      13873 ns      13907 ns      56089
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/4        4393 ns       4381 ns     160255
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/8        9637 ns       9804 ns      74786
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/12      15384 ns      15332 ns      49857
BM_EvaluatePolynomialInMonomialBasisDisplacement<HornerEvaluator>/16      22898 ns      22424 ns      29914

@ts826848
Copy link
Contributor

Clang on macOS can't find <intrin.h>. Including <x86intrin.h> seems to do the trick, though.

@eggrobin
Copy link
Member

We should probably just use <emmintrin.h>, see https://stackoverflow.com/questions/11228855/header-files-for-x86-simd-intrinsics.

@pleroy
Copy link
Member Author

pleroy commented Feb 17, 2018

There is no x86intrin.h on MSVC. I am going for nmmintrin.h (n as in Nehalem) which hopefully exists everywhere. This gives us all of SSE and none of AVX, which is what we want at this point.

@@ -33,6 +39,10 @@ R3Element<Scalar>::R3Element(Scalar const& x,
Scalar const& y,
Scalar const& z) : x(x), y(y), z(z) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the body of the constructor:

static_assert(std::is_standard_layout<R3Element>::value, "blah");

This should ensure that the union is safe, because of the rules pertaining to standard-layout unions (or at least safer; __m128d is heavily implementation-defined, so it could in theory fail to be layout-compatible with doubles).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#if PRINCIPIA_USE_SSE2_INTRINSICS
__m128d const left_128d = ToM128D(left);
return R3Element<Scalar>(_mm_mul_pd(left_128d, right.xy),
_mm_mul_sd(left_128d, right.zt));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems nicer to do

_mm_mul_sd(right.zt, left_128d)

for the second one, so that the result has right.t; in particular, if we wanted to ensure that t=0 this would preserve the invariant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to use _mm_set_sd(left) (and a suitable wrapper in the Quantity case) as the operand for the scalar multiplication, so that we are not waiting for the unpack generated from _mm_set1_pd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the constant argument to the right, although it doesn't seem to have much effect. Not changing to _mm_set_sd as it burns an extra register.

__m128d const left_128d = ToM128D(left);
return R3Element<Product<Quantity<LDimension>, RScalar>>(
_mm_mul_pd(left_128d, right.xy),
_mm_mul_sd(left_128d, right.zt));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@eggrobin eggrobin added the LGTM label Feb 17, 2018
@pleroy pleroy merged commit 66f8f20 into mockingbirdnest:master Feb 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants