-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the Regexp cache to be a simple WeakHashMap #2188
Conversation
…ache keys are shorter-lived than the SoftReferences used before, but it will allow for the rapid expiration of memory-hungry one-shot regexes as well. Fixes jruby#2078
this is really najs ... as far as I understand cases such as : any considerations for using ConcurrentHashMap (just asking as I was looking at this my self previously) ? |
It may still hit the cache; WeakRefs are collected during GC once no more strong refs to an object exist. If you keep a Regexp in scope, or you're doing a lot of regexp work in a loop that happens between GC cycles, it'll continue to hit the cache. ConcurrentHashMap was used in the previous implementation, and it seems to have been a cause of the problems. I think the ideal implementation would be some kind of LRU cache + SoftReferences to eat the cache if memory needs dictated, but sun.misc.LRUCache is O(n) when it moves an element, which was deemed unusable for these purposes. |
Here's a quick little experiment which shows the caching behavior: Basically, the > class Foo; def m; Regexp.new("foobar").tap {|r| p [r.object_id, r.to_java.pattern] }; nil; end; end
=> :m
> Foo.new.m
[2338, #<Java::OrgJoni::Regex:0x2de92661>]
=> nil
> p ObjectSpace._id2ref(2338).to_s; nil
"(?-mix:foobar)"
=> nil
# Put some pressure on the VM so it executes GC
> 1000.times { "foo" * 1_000_000 }
=> 1000
> Java::JavaLang::System.gc()
=> nil
# See if the object still exists
> p ObjectSpace._id2ref(2338).to_s; nil
""
=> nil
# Run #m again - we have a new regex ID
> Foo.new.m
[2340, #<Java::OrgJoni::Regex:0x5b138286>]
=> nil
# The Regex is no longer in scope, but it isn't been GC'd, so creating another regex with the same
# pattern gets us the same underlying regex
> Foo.new.m
[2342, #<Java::OrgJoni::Regex:0x5b138286>]
=> nil
# Just invoking a GC isn't enough to make it go away
> Java::JavaLang::System.gc()
=> nil
> Foo.new.m
[2344, #<Java::OrgJoni::Regex:0x5b138286>]
=> nil |
great, thanks for the sample ... I meant ConcurrentHashMap but with weak keys of course not the previous. |
I initially tried that (though with SoftReferences, rather than WeakReferences), but |
Impl looks good, and it's definitely a great reduction in Regex retention. I did have a question though...is it ok that there's no cleanup of the WeakReference keys you're inserting? What happens when only the ByteList gets dereferenced and collected before the Regex? We'd have a map with a bunch of WeakReference keys in buckets that don't make sense anymore. All that would be required is creating a bunch of Regexp from dynamic strings and saving them. The dynamic strings would dereference, so the bytelists would dereference, so we'd have a bunch of broken keys. It's a bit late, so I'm not going to guarantee my logic is right. If it is, I think we need to persist the hashcode in the WeakReference and set up a ReferenceQueue to clean them up when they die. I thought we had a double weak map in the codebase somewhere, but I can't find it now. |
Wait, I think I see it now...the bytelists are in the keys, which are WeakReferences managed by the WeakHashMap, so they do get cleaned up. That's fine then for the ByteLists. And if the Regex becomes dereferenced...I think we still need some cleanup, because it just returns null now and leaves the possibly-referenced ByteList pointing at an empty value. Not as big a deal as broken keys, but do we still want some cleanup? |
A reference to the ByteList is stored in the Regex (via setUserData), so the Regex can't be collected until the ByteList goes away anyhow, can it? |
Looking at it with a clearer head today... The Regex references the ByteList, so the Regexp can go away if the ByteList is still referenced. That would leave an empty WeakReference in the map that's not cleaned up until the ByteList key is removed or replaced with a live value. It's not a big deal to cleanup, so I'll go ahead and merge and then add the additional scrubbing. |
On second thought... I realized today that was just need a simple map with weak values. If you are repeatedly requesting the same/matching ByteList be compiled to Regex, then the entry will stay alive and we'll return it. If you're not, and the Regex is not used anywhere else, it will remain in the cache. The only potential for leaking is if the Regex reference is held elsewhere and you never return to the cache, which isn't really a leak as much as it is an unused cache entry. That "leak" is limited by the number of Regex entries you're holding outside the cache, which are almost certainly a lot larger than the entries themselves. I'm going to close this and do the simple weak map version. |
This means that cache keys are shorter-lived than the SoftReferences used before, but it will
allow for the rapid expiration of memory-hungry one-shot regexes as well.
Implemented per discussion with @headius in IRC.
Testing with the following script demonstrates the pathological case in master, and the much improved handling with the WeakRefCache:
I think the issue here is that when there are a large number of keys in the caches, the overhead of the actual ConcurrentHashMap currently used to back the cache becomes prohibitive and starves the JRuby process. This results in something like this (and JRuby either stalls or crashes once it gets into this boundary behavior):
However, with the WeakHashMap cache, behavior is more like this (and the test finishes in ~9 sec):
Fixes #2078