Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspend speedup #3137

Merged
merged 9 commits into from Feb 7, 2018
Merged

Suspend speedup #3137

merged 9 commits into from Feb 7, 2018

Conversation

msimberg
Copy link
Contributor

@msimberg msimberg commented Feb 2, 2018

This speeds up #3129.

Proposed Changes

  • Adds two member functions, suspend_direct and resume_direct to thread_pool_base which don't spawn a new thread for suspending/resuming but instead call suspend/resume_internal directly.
  • Use suspend_direct and resume_direct in threadmanager::suspend/resume to speed up suspend/resume.
  • First notify all threads that they should suspend/resume before doing the blocking calls to suspend/resume_processing_unit_internal which wait for the threads to finish suspending/resuming.

I will post some more detailed benchmarks later, but rough performance numbers on dual-socket 18-core Xeon machine are (depends of course on the number of threads used):

  • HPX start/stop: 50 ms
  • HPX suspend/resume: 500 µs
  • OpenMP enter and exit a #pragma parallel for region: 100 µs

@hkaiser hkaiser added this to the 1.1.0 milestone Feb 2, 2018
@hkaiser
Copy link
Member

hkaiser commented Feb 2, 2018

Nice! Thanks!

int runtime_impl::suspend()
{
std::uint32_t initial_num_localities = get_initial_num_localities();
if (initial_num_localities > 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale of this limitation? Wouldn't it be perfectly ok to suspend the runtime on one of the nodes at a time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply that I haven't tested it in those situations, and I did not think it would be useful. But you're right, it should be okay in those cases as well. Do you see a use case for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a migration path for distributed MPI+X applications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not have to be part of this PR, however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'd like to keep it separate in that case so that it can go in with appropriate tests/examples showing how it would be used in that case (and to see if there are any corner cases to be aware of).

hpx::util::detail::yield_while([rt]()
{
return rt->get_state() < hpx::state_running;
}, "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me that a) yield_while shouldn't be in detail namespace and b) that its second argument should have a default value (or it has to be exposed through a different API).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Would it be okay if I move only yield_while into hpx/util/yield_while.hpp and leave yield_k where it is? I see yield_k as being a lower level helper function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Could be done in a separate PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good. I'll do this in a separate PR as soon as this is merged.

Copy link
Member

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@msimberg msimberg merged commit 4fb4a3d into STEllAR-GROUP:master Feb 7, 2018
@msimberg
Copy link
Contributor Author

msimberg commented Feb 15, 2018

For the record, this is what the timings look like. This is with the balanced thread binding, so one can see bumps at 19 and 37 threads, going to another numa domain and to hyperthreading, respectively.

The openmp benchmark is a #pragma parallel for doing a simple addition to avoid it being optimized away. For fairness HPX spawns a number of empty tasks equal to the number of threads before suspending. Plain resume/suspend without any tasks running is at least a factor of two faster than in the graph.

runtime-suspension

There is most likely some room to improve further but this is already good enough for blocks of work of at least 0.1 s.

Linear y-axis:
runtime-suspension-linear

Linear y-axis without HPX start/stop:
runtime-suspension-linear-no-start-stop

@hkaiser
Copy link
Member

hkaiser commented Feb 15, 2018

@msimberg very nice results! Would you mind adding the benchmark itself to the tests/performance/local folder in the repo?

@msimberg
Copy link
Contributor Author

Don't mind at all, will add them soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants