Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the Regexp cache to be a simple WeakHashMap #2188

Closed
wants to merge 2 commits into from

Conversation

cheald
Copy link
Contributor

@cheald cheald commented Nov 13, 2014

This means that cache keys are shorter-lived than the SoftReferences used before, but it will
allow for the rapid expiration of memory-hungry one-shot regexes as well.

Implemented per discussion with @headius in IRC.

Testing with the following script demonstrates the pathological case in master, and the much improved handling with the WeakRefCache:

require 'benchmark'
Benchmark.bm do |x|
  x.report("regexp creation") do
    count = 0
    2000.times do |i|
      str = "test#{i}" * 5
      Regexp.new(str)
      i.times do |j|
        Regexp.new(str)
        Regexp.new("test#{i * j}" * 5)
      end
      print "."
    end
  end
end

I think the issue here is that when there are a large number of keys in the caches, the overhead of the actual ConcurrentHashMap currently used to back the cache becomes prohibitive and starves the JRuby process. This results in something like this (and JRuby either stalls or crashes once it gets into this boundary behavior):

However, with the WeakHashMap cache, behavior is more like this (and the test finishes in ~9 sec):

Fixes #2078

…ache

keys are shorter-lived than the SoftReferences used before, but it will
allow for the rapid expiration of memory-hungry one-shot regexes as well.

Fixes jruby#2078
@cheald
Copy link
Contributor Author

cheald commented Nov 13, 2014

Better view at 6k iterations:

This implementation manages to avoid clogging up the GC, even when a huge number of regexes are passed through the cache.

@kares
Copy link
Member

kares commented Nov 13, 2014

this is really najs ... as far as I understand cases such as : def hello(name); name =~ /world/ end won't no longer cause the regexp to hit the cache since it's not reachable after the method execution?

any considerations for using ConcurrentHashMap (just asking as I was looking at this my self previously) ?

@cheald
Copy link
Contributor Author

cheald commented Nov 13, 2014

It may still hit the cache; WeakRefs are collected during GC once no more strong refs to an object exist. If you keep a Regexp in scope, or you're doing a lot of regexp work in a loop that happens between GC cycles, it'll continue to hit the cache.

ConcurrentHashMap was used in the previous implementation, and it seems to have been a cause of the problems.

I think the ideal implementation would be some kind of LRU cache + SoftReferences to eat the cache if memory needs dictated, but sun.misc.LRUCache is O(n) when it moves an element, which was deemed unusable for these purposes.

@cheald
Copy link
Contributor Author

cheald commented Nov 13, 2014

Here's a quick little experiment which shows the caching behavior:

Basically, the RubyRegexp holding a Joni::Regex will have to both go out of scope and cease to exist before the Joni::Regex can be removed from the cache. This isn't quite as good as an LRU cache, but if you're regularly referencing a regexp, it's going to hang around for a bit even after execution leaves Ruby scope.

> class Foo; def m; Regexp.new("foobar").tap {|r| p [r.object_id, r.to_java.pattern] }; nil; end; end
 => :m
> Foo.new.m
[2338, #<Java::OrgJoni::Regex:0x2de92661>]
 => nil
> p ObjectSpace._id2ref(2338).to_s; nil
"(?-mix:foobar)"
 => nil

# Put some pressure on the VM so it executes GC
> 1000.times { "foo" * 1_000_000 }
 => 1000
> Java::JavaLang::System.gc()
 => nil

# See if the object still exists
> p ObjectSpace._id2ref(2338).to_s; nil
""
 => nil

# Run #m again - we have a new regex ID
> Foo.new.m
[2340, #<Java::OrgJoni::Regex:0x5b138286>]
 => nil

# The Regex is no longer in scope, but it isn't been GC'd, so creating another regex with the same
# pattern gets us the same underlying regex
> Foo.new.m
[2342, #<Java::OrgJoni::Regex:0x5b138286>]
 => nil

# Just invoking a GC isn't enough to make it go away
> Java::JavaLang::System.gc()
 => nil
> Foo.new.m
[2344, #<Java::OrgJoni::Regex:0x5b138286>]
 => nil

@kares
Copy link
Member

kares commented Nov 13, 2014

great, thanks for the sample ... I meant ConcurrentHashMap but with weak keys of course not the previous.

@cheald
Copy link
Contributor Author

cheald commented Nov 13, 2014

I initially tried that (though with SoftReferences, rather than WeakReferences), but SoftReference(foo) != SoftReference(foo), so you can't ever retrieve object references after sticking them into the cache.

@headius
Copy link
Member

headius commented Nov 14, 2014

Impl looks good, and it's definitely a great reduction in Regex retention.

I did have a question though...is it ok that there's no cleanup of the WeakReference keys you're inserting? What happens when only the ByteList gets dereferenced and collected before the Regex? We'd have a map with a bunch of WeakReference keys in buckets that don't make sense anymore. All that would be required is creating a bunch of Regexp from dynamic strings and saving them. The dynamic strings would dereference, so the bytelists would dereference, so we'd have a bunch of broken keys.

It's a bit late, so I'm not going to guarantee my logic is right. If it is, I think we need to persist the hashcode in the WeakReference and set up a ReferenceQueue to clean them up when they die.

I thought we had a double weak map in the codebase somewhere, but I can't find it now.

@headius
Copy link
Member

headius commented Nov 14, 2014

Wait, I think I see it now...the bytelists are in the keys, which are WeakReferences managed by the WeakHashMap, so they do get cleaned up. That's fine then for the ByteLists. And if the Regex becomes dereferenced...I think we still need some cleanup, because it just returns null now and leaves the possibly-referenced ByteList pointing at an empty value. Not as big a deal as broken keys, but do we still want some cleanup?

@cheald
Copy link
Contributor Author

cheald commented Nov 14, 2014

A reference to the ByteList is stored in the Regex (via setUserData), so the Regex can't be collected until the ByteList goes away anyhow, can it?

@headius
Copy link
Member

headius commented Nov 23, 2014

Looking at it with a clearer head today...

The Regex references the ByteList, so the Regexp can go away if the ByteList is still referenced. That would leave an empty WeakReference in the map that's not cleaned up until the ByteList key is removed or replaced with a live value.

It's not a big deal to cleanup, so I'll go ahead and merge and then add the additional scrubbing.

@headius
Copy link
Member

headius commented Nov 23, 2014

On second thought...

I realized today that was just need a simple map with weak values. If you are repeatedly requesting the same/matching ByteList be compiled to Regex, then the entry will stay alive and we'll return it. If you're not, and the Regex is not used anywhere else, it will remain in the cache.

The only potential for leaking is if the Regex reference is held elsewhere and you never return to the cache, which isn't really a leak as much as it is an unused cache entry. That "leak" is limited by the number of Regex entries you're holding outside the cache, which are almost certainly a lot larger than the entries themselves.

I'm going to close this and do the simple weak map version.

@headius headius closed this Nov 23, 2014
@headius headius added this to the Invalid or Duplicate milestone Nov 23, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Very large retained heap size for org.jruby.RubyRegexp$RegexpCache in JRuby Rails App
3 participants