-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FMA in polynomial evaluation #3023
Conversation
…iler does not inline that it fails at common subexpression elimination and we are in unpcklpd hell
numerics/polynomial_evaluators.hpp
Outdated
@@ -27,8 +27,11 @@ struct EstrinEvaluator { | |||
degree, | |||
internal_polynomial_evaluators::EstrinEvaluator>::Coefficients; | |||
|
|||
template<bool fma> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the fma
bit in the main template parameter part of EstrinEvaluator
(same for Horner). No point in multiplying the levels of templating, it's not like the syntax is beautiful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to hide this implementation detail from the signature of the evaluator class, which is part of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. That would be ugly.
auto const b = | ||
InternalEstrinEvaluator<Value, Argument, degree, low, m - 1, fma>:: | ||
Evaluate(coefficients, argument, argument_squares); | ||
if constexpr (fma) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be tempting to pass a pointer-to-function instead of the disgraceful fma
bit, and have a helper function to compute a * b + c
in the non-FMA case. Unfortunately, that's probably hard because writing the parameter types for that function would be hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and naming the « multiply-and-add, maybe fused but maybe not » operation seems weird—the FMA calls are already ugly but at least here they parallel a nice a * b + c.
Unrelated to your change, but the benchmarks have some weirdness:
What happened to
Median CPU time of 0? Really? |
The CPU time counter from the OS is noisy anyway, since we are not doing I/O etc. we are better looking at time (or rtdsc) |
numerics/polynomial_body.hpp
Outdated
PolynomialInMonomialBasis<Value_, Argument_, degree_, Evaluator>::operator()( | ||
Argument const& argument) const { | ||
return UseHardwareFMA | ||
? Evaluator<Value, Difference<Argument>, degree_>::Evaluate< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give a name to Evaluator<...>
and use it in both branch to reduce the visual noise. (Several places.)
retest this please |
template<typename Value, typename Argument, int degree, bool fma> | ||
FORCE_INLINE(inline) Derivative<Value, Argument, degree> | ||
InternalHornerEvaluator<Value, Argument, degree, fma, degree>:: | ||
EvaluateDerivative(Coefficients const& coefficients, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if the old formatting would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope.
On Kaby Lake, this makes Horner evaluation about 2× faster and Estrin 1.5× faster at high degrees.
The dispatching to the FMA vs. no-FMA implementation is done at the same time as the dispatching on degree, so that, when evaluating polynomials of unknown degree, we immediately jump to the correct evaluation code and not to a branch on FMA availability.
Prior to this change, the call to the (virtual)
operator()
could not be inlined (see #3018), so that polynomial evaluations even of known low degree were never inline. NowEvaluateWith[out]FMA
is the call that cannot be inlined.We could add an
operator()
onPolynomialInMonomialBasis
that callsEvaluateWith[out]FMA
non-virtually so that evaluations of polynomials of small degree may be inlined at the compiler’s discretion; I will investigate that later.Benchmark results below.
Note that the
FORCE_INLINE
really want to be on the definitions, otherwise the compiler fails to inline some things, and more importantly then fails at common subexpression elimination forunpcklpd
(see #1731 (comment)), making the new implementation with FMA no faster than the old version without.Before:
After: