New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspend speedup #3137
Suspend speedup #3137
Conversation
…ot spawn a thread)
…r::suspend/resume Suspending/resuming directly without spawning a new thread is significantly faster than using the callback versions and waiting for the callbacks to signal completion.
Nice! Thanks! |
src/runtime_impl.cpp
Outdated
int runtime_impl::suspend() | ||
{ | ||
std::uint32_t initial_num_localities = get_initial_num_localities(); | ||
if (initial_num_localities > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale of this limitation? Wouldn't it be perfectly ok to suspend the runtime on one of the nodes at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply that I haven't tested it in those situations, and I did not think it would be useful. But you're right, it should be okay in those cases as well. Do you see a use case for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing a migration path for distributed MPI+X applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not have to be part of this PR, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'd like to keep it separate in that case so that it can go in with appropriate tests/examples showing how it would be used in that case (and to see if there are any corner cases to be aware of).
hpx::util::detail::yield_while([rt]() | ||
{ | ||
return rt->get_state() < hpx::state_running; | ||
}, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells me that a) yield_while
shouldn't be in detail namespace and b) that its second argument should have a default value (or it has to be exposed through a different API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Would it be okay if I move only yield_while
into hpx/util/yield_while.hpp
and leave yield_k
where it is? I see yield_k
as being a lower level helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Could be done in a separate PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, good. I'll do this in a separate PR as soon as this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
For the record, this is what the timings look like. This is with the The openmp benchmark is a There is most likely some room to improve further but this is already good enough for blocks of work of at least 0.1 s. |
@msimberg very nice results! Would you mind adding the benchmark itself to the |
Don't mind at all, will add them soon. |
This speeds up #3129.
Proposed Changes
suspend_direct
andresume_direct
tothread_pool_base
which don't spawn a new thread for suspending/resuming but instead callsuspend/resume_internal
directly.suspend_direct
andresume_direct
inthreadmanager::suspend/resume
to speed up suspend/resume.suspend/resume_processing_unit_internal
which wait for the threads to finish suspending/resuming.I will post some more detailed benchmarks later, but rough performance numbers on dual-socket 18-core Xeon machine are (depends of course on the number of threads used):
#pragma parallel for
region: 100 µs