New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempting to avoid data races in async_traversal while evaluating dataflow() #3001
Conversation
Ouch these typos..., thanks for correcting it. @hkaiser I think your change doesn't make any difference to the current revision because As I remind there was a check that should prevent that the finalizer of dataflow was called more than once: c852439#diff-040de9e61eef02088de58b48a421a1e5L514 . |
@Naios unfortunately this patch does not fix the problem (it makes it appear less often, though - but this could be coincidental). My current theory is that the container the iterators in https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/detail/pack_traversal_async_impl.hpp#L232 refer to goes out of scope too early. It might be noteworthy that I move the container into dataflow() which might be a use case we have not tested thoroughly enough. I will investigate further. |
Also, I think setting the flag before actually attaching a continuation to the current future changes the behavior as under certain circumstances the continuation might be executed during |
I think in principle, setting the |
The |
Don't the completion handlers concurrently access the variable? |
No the only state which is shared across the suspensions are the template <typename T, typename Hierarchy>
void async_continue(T&& value, Hierarchy&& hierarchy)
{
// Create a self reference
boost::intrusive_ptr<async_traversal_frame> self(this);
// Create a callable object which resumes the current
// traversal when it's called.
auto resumable = make_resume_traversal_callable(
std::move(self), std::forward<Hierarchy>(hierarchy));
// Invoke the visitor with the current value and the
// callable object to resume the control flow.
util::invoke(visitor(), async_traverse_detach_tag{},
std::forward<T>(value), std::move(resumable));
} where the In addition to that since we are resolving the futures one by one after each other we wouldn't encounter any threading issues at all I think. |
Whatever causes the problem, the current code clearly exposes data races in some way. For me the iterators to the shared data start to point into nowhere if dataflow is used when running on more than one thread. |
A minimal testcase would help. Did you try using `std::atomic_flag` instead
of `bool` for `detached_`?
|
I don't have a minimal use-case at the moment. Also, I have not tried using a |
@Naios I think I understand what is going on now. I however have a hard time changing the code to fix it, maybe you can help. The issue is that the current code creates a |
- flyby: fixed possible use after move
@hkaiser I'm not sure what is going on there, are you open for a call tomorrow? Theoretically every future should be traversed only once, thus the traversal order is stricly ordered. |
@Naios the latest commit fixes the issue I was seeing. Please verify that I have not broken anything. |
- fixed another potential use after move problem - fixed memory leak (dataflow was leaking its shared state) - simplified function operator implementation for resume_state_callable
d0fb5b7
to
f1285f3
Compare
return typename types::frame_pointer_type(ptr); | ||
// Create an intrusive_ptr from the heap object, don't increase | ||
// reference count (it's already 'one'). | ||
return typename types::frame_pointer_type(ptr, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch here 👍
@@ -514,8 +553,7 @@ namespace util { | |||
template <typename Frame, typename State> | |||
void resume_traversal_callable<Frame, State>::operator()() | |||
{ | |||
auto hierarchy = | |||
util::tuple_cat(util::make_tuple(frame_), std::move(state_)); | |||
auto hierarchy = util::tuple_cat(util::make_tuple(frame_), state_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every resume_traversal_callable
instance should only be called once, so the move should be valid here.
Maybe we could guard this through an assertion.
Which issue did occur with this move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, I was not sure whether this may have caused my issues. I will revert this change and add an assertion making sure this doesn't cause any problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the same might be true for this change: https://github.com/STEllAR-GROUP/hpx/pull/3001/files#diff-a33368bac14c4ebae471f68e8cb9da76R381
for (/**/; !range.is_finished(); ++range) | ||
{ | ||
async_traverse_one(range); | ||
if (is_detached()) // test before increment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm wondering here about, is that the iterator isn't modified inside async_traverse_one
.
When !range.is_finished()
it should be perfectly valid to increment the iterator.
In the worst case we finish the range but in this case it should never yield an iterator dereferencation error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the iterators are not changed inside async_traverse_one
. However this function might move away the vector those iterators refer to (see the std::move(args_)
here:
visitor(), async_traverse_complete_tag{}, std::move(args_)); |
@@ -275,14 +300,15 @@ namespace util { | |||
{ | |||
Frame frame_; | |||
tuple<Hierarchy...> hierarchy_; | |||
bool detached_; | |||
std::atomic<bool>& detached_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The usage of the reference is a great simplification 👍
- From my point of understanding making the
detached_
variable atomic doesn't make any difference here since as described earlier the variable isn't shared across threads because we create a local variable for every resumption. So this can be ellided for sure. My assumption is that I forgot to propagate the value of the variable back somewhere. Probably theatomic
can be converted back to a normal bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will verify whether converting to a plain bool
is sufficient.
explicit static_async_range(Target* target) : target_(target) {} | ||
|
||
static_async_range(static_async_range const& rhs) = default; | ||
static_async_range(static_async_range && rhs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the custom move constructors and operators aren't needed (and probably it is left from debugging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are not really needed, indeed. OTOH, explicitly setting the moved-from object to be invalid might be a good thing in general.
@hkaiser I was able to reproduce the Issue, see my comments above. Especially see:
|
@hkaiser I added assertions to ensure that the traversal is finished only once and it seems that the final handler is called at least twice which is probably the main reason of the iterator dereferencation error: |
I think I addressed this concern above.
This might explain things. Will you investigate this? |
@Naios I think the |
- demoting detached flag to plain boolean.
@Naios please see my latest patch adding an atomic |
We could wrap the |
It mainly prevents from calling the finalize code more than once. |
@Naios are you ok with this change now? |
@hkaiser I'm not sure whether this is the correct solution. Yesterday I spent more time on this and at least for me it seems like the memory of the traversal frame got corrupted. Or the pointer is freed too early. |
@Naios Here is what I know. The main problem is that under certain circumstances, finalize might be called more than once. I have not been able to think of a scenario how this could happen, but apparently - in concurrent scenarios only - there is a chance of this occurring. Now, the first invocation of finalize moves the vector into the function invoked by dataflow. This invalidates the iterators into said vector - which is flagged as an error by the iterator checking facilities. This PR proposes to fix this by changing three things:
|
There is an issue somewhere else, maybe we could just merge this workaround, since it was present in the original code too. Did you test whether the |
Ok, I take this as you accepting this PR ;-)
This is an assumption on my end as I have no other explanation for the things going on. |
@Naios please verify that this does not break anything