Attempting to avoid data races in async_traversal while evaluating dataflow() #3001

hkaiser · 2017-11-11T22:03:43Z

@Naios please verify that this does not break anything

…taflow()

Naios · 2017-11-12T00:10:05Z

Ouch these typos..., thanks for correcting it. @hkaiser I think your change doesn't make any difference to the current revision because detach() just sets the bool detached_; member inside the async_traversal_point class rather than altering the control flow directly and thus the order of both calls doesn't matter.

As I remind there was a check that should prevent that the finalizer of dataflow was called more than once: c852439#diff-040de9e61eef02088de58b48a421a1e5L514 .
While having a Skype conversation we went through this section and came to the conclusion that it could be dropped because we thought that there is no possibility the continuation of the future will be executed more than once. Maybe this has something to do with our issue here, but I'm not sure.

hkaiser · 2017-11-12T00:40:24Z

@Naios unfortunately this patch does not fix the problem (it makes it appear less often, though - but this could be coincidental). My current theory is that the container the iterators in https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/detail/pack_traversal_async_impl.hpp#L232 refer to goes out of scope too early. It might be noteworthy that I move the container into dataflow() which might be a use case we have not tested thoroughly enough. I will investigate further.

hkaiser · 2017-11-12T00:45:11Z

Also, I think setting the flag before actually attaching a continuation to the current future changes the behavior as under certain circumstances the continuation might be executed during async_continue in which case the next async-traversal is started even before the detached flag might be reset,

sithhell · 2017-11-15T11:38:24Z

I think in principle, setting the detached flag before executing async_continue is correct. A race might still occur since detached_ is not synchronized in any way. It should be either protected by a lock or changed to be std::atomic_flag (or similar).

Naios · 2017-11-15T11:55:18Z

The detached variable isn't shared across threads, because we create a new variable for every resumed traversal.

sithhell · 2017-11-15T12:22:17Z

Don't the completion handlers concurrently access the variable?

Naios · 2017-11-15T12:41:33Z

No the only state which is shared across the suspensions are the async_traversal_frame which is stored on the heap and the iterators of the current traversal hierarchy.
See the arguments of frame_->async_continue(*current, std::move(state)); and its implementation:

            template <typename T, typename Hierarchy>
            void async_continue(T&& value, Hierarchy&& hierarchy)
            {
                // Create a self reference
                boost::intrusive_ptr<async_traversal_frame> self(this);

                // Create a callable object which resumes the current
                // traversal when it's called.
                auto resumable = make_resume_traversal_callable(
                    std::move(self), std::forward<Hierarchy>(hierarchy));

                // Invoke the visitor with the current value and the
                // callable object to resume the control flow.
                util::invoke(visitor(), async_traverse_detach_tag{},
                    std::forward<T>(value), std::move(resumable));
            }

where the make_resume_traversal_callable transfers the current state to the next resumption.

In addition to that since we are resolving the futures one by one after each other we wouldn't encounter any threading issues at all I think.

hkaiser · 2017-11-15T15:56:03Z

Whatever causes the problem, the current code clearly exposes data races in some way. For me the iterators to the shared data start to point into nowhere if dataflow is used when running on more than one thread.

sithhell · 2017-11-15T16:46:08Z

A minimal testcase would help. Did you try using `std::atomic_flag` instead of `bool` for `detached_`?

hkaiser · 2017-11-15T17:28:31Z

I don't have a minimal use-case at the moment. Also, I have not tried using a std::atomic_flag.

…taflow()

hkaiser · 2017-11-18T12:23:45Z

@Naios I think I understand what is going on now. I however have a hard time changing the code to fix it, maybe you can help.

The issue is that the current code creates a detached flag for each iteration level, attempting to propagate the status of the child back up on exit. This causes subtle data race problems caused by futures becoming ready out of order. For this reason, the old dataflow code had exactly one flag to track the 'ready' state. I believe the problems should go away if the new code is changed such that it will use just one Boolean to track the readiness (instead of one per iteration level).

- flyby: fixed possible use after move

Naios · 2017-11-18T15:24:14Z

@hkaiser I'm not sure what is going on there, are you open for a call tomorrow?
I'm trying to get my phylanx installation ready meanwhile.

Theoretically every future should be traversed only once, thus the traversal order is stricly ordered.
The detached bool variable is only used for marking the current execution context as abadoned.
When the last future is traversed the final handler is called, thus I think that a global detached variable would be a workaround for an issue we currently don't know about.

hkaiser · 2017-11-18T23:39:10Z

@Naios the latest commit fixes the issue I was seeing. Please verify that I have not broken anything.

- fixed another potential use after move problem - fixed memory leak (dataflow was leaking its shared state) - simplified function operator implementation for resume_state_callable

Naios · 2017-11-19T15:29:13Z