-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to local variables not visible to other threads even when protected by a lock #3891
Comments
After thinking about this for a few more minutes we realized that there's an even simpler way to illustrate the problem: lock = Mutex.new
counter = 0
threads = Array.new(2) do
Thread.start do
10_000.times do
lock.lock
begin
counter += 1
ensure
lock.unlock
end
end
end
end
threads.map(&:join)
puts("#{JRUBY_VERSION}: #{counter}") $ rvm jruby-1.7.25,jruby-9.0.5.0 do ruby jr9k.rb
1.7.25: 20000
9.0.5.0: 12835 Different numbers are printed by JRuby 9K for every run. |
You are correct, and this is likely due to an optimization we do in 9k that we'll have to back off. The optimization I'm speaking of is that we will only publish closed-over variables near calls, to avoid the cost of going to/from variables stored in an object on the heap. This works ok for a single thread but if multiple threads are there they may not see changes other threads have made. To test this, I ran your script with -X-C, which only runs in our interpreter (which does not do this optimization normally). It runs just fine. |
I am running out the door .. but, lock.lock and lock.unlock are both calls and so this should have worked as expected. I suspect something is up with the ensure. Try without the ensure and see if it works. I can investigate later in the night. |
Removing |
Here's the main part of the IR for the closure, with begin/ensure, queue access, and "local_remaining" removed:
The binding load of |
Yes, this is a real bug ... 'counter' is not reloaded inside the locked region because "blah.lock" doesn't have a closure passed into it, and the analysis assumes that the call cannot modify any scope vars .. which is true (except for evals but we handle that specially already). But, lock is special here .. anyway, needs further investigation and discussion. |
Anyway, simplest solution for now is to turn off the AddLocalVar* pass and make sure JIT adds the loads/stores for local var access. I can take a look if that doesn't work or if something breaks. |
When I removed AddLocalVar, I got a "nil cannot be coerce to X" type warning, so I imagine some value was not getting loaded properly anymore. I did not dig any deeper than that. What do you mean "JIT adds the loads/stores for local var acess"? |
Here is a script that I believe demonstrates the effect deterministically: counter = 0
q1 = Queue.new
q2 = Queue.new
t1 = Thread.start do
counter += 1
q1 << nil
q2.pop
counter += 1
end
t2 = Thread.start do
q1.pop
counter += 1
q2 << nil
end
t1.join
t2.join
puts("#{JRUBY_VERSION}: #{counter}") The counter variable is incremented three times, in two different threads. The two queues are used to ensure execution of the threads are interleaved in the (un)desired order. The should produce 3 every time, but (at least seemingly) deterministically produces 2 in JRuby 9k. When passing In addition to serving as a deterministic example, it also shows that |
Yup, we figured out y'day that this is a bigger bug. We have a couple different fix strategies that will handle this. |
This was a good experiment, but we're not properly ensuring the heap variables are being loaded live when needed, causing examples like that in jruby#3891 to fail to propagate changes across threads. By implementing LocalVariable load/store logic in JIT and turning off the "Add" pass, we basically revert heap vars to always being read/written immediately, as in JRuby 1.7.25. It may be possible to improve the pass so that it localizes the loads and stores better and ensures we don't miss updates we should see, but this commit will test whether the "nuclear option" passes all our suites.
@grddev If you'd like, you can test out the This may not be the eventual fix, but I pushed the PR to test and discuss. |
The problem seems to be that the analyses think that the binding for the blocks passed to the threads haven't escaped, but it is not true since counter is present in the parent scope and given that the blocks reference a variable in the parent scope, the binding for the block's scopes have escaped (i.e. can be modified outside the current execution context). You can see that all the threading runs properly if you introduce a boo{} in the thread bodies where boo is simply def boo; end So, the fix might be as simple as fixing the scope.bindingHasEscaped() method. |
Separately, once that fix is done, it is worth evaluating perf. with / without this pass since the JIT can now handle the pass being turned off. |
This was a good experiment, but we're not properly ensuring the heap variables are being loaded live when needed, causing examples like that in jruby#3891 to fail to propagate changes across threads. By implementing LocalVariable load/store logic in JIT and turning off the "Add" pass, we basically revert heap vars to always being read/written immediately, as in JRuby 1.7.25. It may be possible to improve the pass so that it localizes the loads and stores better and ensures we don't miss updates we should see, but this commit will test whether the "nuclear option" passes all our suites.
So, I started looking at the code to fix it and realized that my earlier hunch about 'bindingHasEscaped' was wrong because 'bindingHasEscaped' has a very specific semantics which is that the current dyn-scope has escaped. Anyway, a little bit more code reading shows me that jruby/core/src/main/java/org/jruby/ir/dataflow/analyses/LoadLocalVarPlacementNode.java Lines 79 to 88 in eae1493
I am looking for "!scope.definesLocalVariable(v)" and the use/def vars are set up in jruby/core/src/main/java/org/jruby/ir/IRScope.java Lines 939 to 959 in eae1493
|
Sorry about the wild-goose chase .. in the end, the fix was very simple .. I was trying to diagnose the problem when multi-tasking heavily instead of context switching jruby properly and looking at the code. |
Anyway, one good thing about me sending everyone on a wild-goose chase is that we we got #3898 and related fixes which now gives us another perf. knob. So it is worth merging that PR and see how performance differs with and without the pass enabled. |
Disable the AddLocalVarLoadStore pass to fix #3891.
I have merged the logic to JIT properly without ALVLS, so we can now evaluate both modes and see whether we want to turn it back on. |
I don't think we ever evaluated whether to reinstate ALVLS, did we? @subbuss so you think ALVLS would no longer exhibit this problem? I think evaluation will mostly involve a study of how many instructions (IR and JVM) are produced with and without ALVLS on some typical pieces of code. |
Ya, I don't think it will ... Yes, evaluate code size + perf. I suspect perf. changes will mostly be noise since most of the benefit comes from dynscope removal which was originally part of this ALVLS pass but which I split out of this since I could deduce that without this analysis. |
In JRuby 1.7 (all versions we have tested, including 1.7.25), the following script runs forever. The distilled test-case decrements a variable from two threads, while the main thread waits for all decrements have been applied. The decrements are then checked to ensure that each decrement call has seen a unique value of the counter.
The example is obviously nonsensical, but distilled from a much larger program where the counter was decremented in multiple threads, and then when the count reached zero, the code would continue. Tracking the intermediate values was added in order to facilitate debugging.
In JRuby 9k (tried 9.0.5.0 and 9.1.0.0-SNAPSHOT (2.3.0) 2016-05-03 9ee8643) the script terminates quickly with duplicates in the returned counts.
The output contains a
.
for each attempt, and contains a list of each observed value of the counter. As you can see in the output above, the values 199, 186, etc. are repeated (an no value below 42 is present).Somehow, the updates to the value of
remaining
is not visible to other threads as expected. Notice thatremaining
is only read (and copied intolocal_remaining
) inside the critical section.Changing the script slightly removes the problem. For example:
remaining
to[200]
and changing all references to read and write the first element of the array.lock.lock
andlock.unlock
withlock.synchronize
.remaining -= 1
inproc { ... }.call
.remaining
to an instance variable@remaining
.To eliminate any issues with
Queue
andMutex
we also reproduced the same issues withjava.util.concurrent.locks.ReentrantLock
andjava.util.concurrent.ArrayBlockingQueue
.Our conclusion is that updates to local variables are not guaranteed to be visible to other threads, even if updates are protected with critical sections.
The text was updated successfully, but these errors were encountered: