Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for #load_iseq-like callback #5206

Open
MaxLap opened this issue Jun 4, 2018 · 36 comments
Open

Add support for #load_iseq-like callback #5206

MaxLap opened this issue Jun 4, 2018 · 36 comments

Comments

@MaxLap
Copy link

MaxLap commented Jun 4, 2018

Hello, I'm one of the dev of deep-cover and I am hitting a bit of a challenge to fully support JRuby.

Little context, we instrument Ruby code in order to add trackers (setters in a global variable) to detect what has been executed and what hasn't. This allows us to do branch coverage and "node" coverage, which basically deals with the coverage at the AST node granularity.

We are wondering if a method similar to #load_iseq of Ruby 2.3+ could be added to JRuby.

We don't need anything related to compiling the code. All we need is a method that gets called when JRuby is about to execute Ruby code, either for a #require, a #load, or when triggered by an autoload. That method would receive the absolute path to the file, and return the the actual Ruby code to execute.

Maybe there is even already a way of doing this with some monkey patch?

@headius
Copy link
Member

headius commented Jun 4, 2018

There might be a way to hack something like this in today, but it would be tricky. In any case, I think it's a great idea. I was just talking with @apotonick about making it possible to hook into code loading/parsing/compiling so the AST or IR could be tweaked along the way. The two ideas seem very similar.

@MaxLap
Copy link
Author

MaxLap commented Jun 4, 2018

Great!

I'd be interested in the hack for how to do it today. This would allow us to support older versions, and once this new feature you are talking about is released, we can detect which one to use.

@headius
Copy link
Member

headius commented Jun 8, 2018

Hey I'm back home and back to proper work again. So load_iseq is basically a callback the CRuby runtime invokes with the path to the file, and it is expected to return a CRuby InstructionSequence that will then be used for that file rather than loading, parsing, and compiling it again.

After poking around a bit, I'm afraid I can see no way of doing this with current JRuby. Once the file to load has been found, it is immediately passed off to the parser and compiler, and there's no downcalls into Ruby during that process.

However, this should be fairly simple to support. Instead of iseq, we would have IR. Instead of a method you overwrite (wow, super gross) I would prefer to have a way to register file handlers. They'd return either false or the compiled IR for the script. If false, the next in the chain would be invoked, and so on until no handlers are left and the default parse+compile executes.

@enebo Thoughts on this? We already have mechanisms in place for saving and loading IR to a binary. All we really need to do is open up the loading and saving mechanisms to Ruby and add this callback before we load a file (if callback has been added; no-op and fall back to normal handling otherwise).

We could also retool the persistence mechanism to operate from Ruby using this callback. It would be the same persistence logic, but it would run in Ruby and be observable from Ruby, so loading and IR persistence could be hooked into and observed by tools and tweaked by customizers.

We might also consider doing this as a Java callback, which could obviously still be used from Ruby. This would make it more standard for us to take a filename and load it in a number of ways, like using a source cache or AST cache instead of an IR cache, precompiling it all the way to JVM bytecode immediately, searching in classpath for an equivalent precompiled script, etc. All of these would just be hooks into the loading mechanism, and more standard than the messy hodge-podge of steps we have right now.

@enebo
Copy link
Member

enebo commented Jun 8, 2018

@headius @MaxLap what is the most basic description of what is needed? Source which is loaded and location + perhaps actual source in case of eval? I really don't see any value to returning the IR itself. You don't actually use iseq YARV instr data itself do you?

I think if you want source the solution is to go way earlier than IR/ISeq and do what @headius suggests and make an event/callback on when parser is invoked. parser of non-method/blocks will not be exact moment it is executed so that maybe is an issue but evals and all other execution scopes will be at same point it executes.

Another point would be when AST is converted to our IR (second solution). A callback there would still generate all non-block/methods when that code is about to execute and we can return filename/linenumber (although source internally is no longer available). For methods we are lazy so when IR is built for them it means we plan on executing at that point. Blocks are weirder in we generate IR at same time the non-block scopes are made into IR so you need to know whether you are in a method or not. You will not know when the block actually executes though (halting problem).

@MaxLap
Copy link
Author

MaxLap commented Jun 8, 2018

The most basic description of what i need:

When anyone executes ruby code from a file, so #require, #load and when triggered by autoload (which should use #require anyway), i want to be able to get the code and modify it, and then have it executed as if it was the code from the file.

So no matter how you want to do it, as long as I can get the original code, modify it, and then have JRuby execute it as if it came fron the originaly requested file (same stack trace), I'm happy. It can be that i must compile somehow to an intermediate representation first like #load_iseq, or i just return source code, both work for me. I can receive a file path, or the source itself directly, both work for me.

All i want it to modify ruby code that is backed by a file before it gets executed,so that i can then generate coverage report.

I use #load_iseq in MRI because it's the only way to hook the tramsform code from require and load before it gets executed without replacing the entire require mecanism. (and i do replace it for older versions, which is a real pain and is slow and require qui a few hacks to deal with autoload). The "can return compiled code" just means i call compile before returning.

Thanks

@enebo
Copy link
Member

enebo commented Jun 8, 2018

@MaxLap thanks for that description. I think based on this we want a callback from our loadservice which gives you the source before our parser starts parsing. We probably take the result of that callback as the actual source to parse. We could support giving back AST or IR but those we like to be able to change over time so supporting them as a public API is not desirable.

This callback we just need to define. Perhaps JRuby::Parser::load_filter { |filename, line_number, src| return mod_src }? Thinking about generic usage of this adding line_number probably would allow someone to not change source but properly use this to record parsing events. The return asymmetry bugs me (f(file, line, src) -> src) a little bit though. @headius what do you think?

Now that I wrote that out I am wondering if there are any other novel uses of having a method like this. Although this issue does not seem concerned with evals I could see this as a useful way of capturing all eval'd source so people could view it somehow later. The only possible problem I can think of is maybe during kernel loading some aspect of this won't work (since the system is not fully bootstrapped yet) so we may only want this on after that point?

Oh jetlag you make me ramble.

@MaxLap
Copy link
Author

MaxLap commented Jun 8, 2018

I forgot to mention in my description that i need the file path too.

Is the line number useful? Maybe if you decide to support this for #eval, but otherwise, it's always going to be 1, no?

If you do decide to support #eval, the callback would need to know that its an eval as for my use case, I don't want to deal with them.

For eval, it seems the meaning of file_path (and line number) would be different. In the require case, those are the file being executed and in the eval case, they are the location of the callsite? (can't be anything else as the code could be a string from another file)

To me, those 2 points mean that if you ever decide to have a similar callback for eval, it should be a different one.

And I think this means the callback for require most likely doesn't need the line number. Unless there is something i don't see or misunderstood.

@marcandre
Copy link

I'm sorry I missed an opportunity to sit down with you @enebo at RubyKaigi to discuss this.

JRuby::Parser::load_filter { |filename, src, line_number| return mod_src }

Sounds pretty good. I'd put src as first argument though. As @MaxLap said, it's necessary to be able to distinguish eval from require/load, so maybe an extra argument method or similar? Or line_number could be :required / :loaded in such a case? (Maybe the jetlag is also speaking ;-) )

We could support giving back AST or IR

If you want, but best use a different hook as this wouldn't be composable, i.e. only one such filter could be specified.

I'm thinking of making a small gem that would have the same interface for JRuby, MRI and TruffleRuby that would allow to compose such filters. In JRuby it would be "natural", while in MRI the caveat would be that it would use load_iseq and that such calls aren't composable.

I'd like to get @eregon of TruffleRuby in the discussion too, as we'll have the same request for it :-)

@enebo
Copy link
Member

enebo commented Jun 8, 2018

Revised idea:

JRuby::Parser::{load_filter, require_filter, eval_filter}

load and require will accept block which is |src, filename|. eval will also take linenumber and binding. The return value will be the code to be parsed unless nil is returned and then it continues to use original source. Seems fairly simple for all three implementations to support but I am not sure what type it should be on (maybe RubyVM::Parser). TruffleRuby uses a lot more Ruby source for core methods so perhaps lifecycle of when this would be available may be important?

Lastly default implementation of load_filter and require_filter can be same method aliased.

@MaxLap
Copy link
Author

MaxLap commented Jun 8, 2018

Sounds perfect for our needs.

Lifecycle wise, the normal use-case is that in the test_helper / spec_helper, we get required and then we add the load_filter. I guess at that point everything, (or almost?) including this new feature, will be available?

@headius
Copy link
Member

headius commented Jun 11, 2018

First a quick question:

All i want it to modify ruby code that is backed by a file before it gets executed,so that i can then generate coverage report.

Is there something the coverage module is not doing for you? We do not support the newer branch/etc coverage options, but it doesn't sound like you need that.

Returning to the feature in question...

So the main use case proposed here is to be able to intercept the load of Ruby files and eval source and in some way manipulate those sources on their way to being parsed and compiled. Or after parsed or after compiled but before bound or something.

The arguments passed in should be everything necessary to properly read, parse, and compile the sources. I believe the file+src for normal loads/requires and file+line+binding+src would be sufficient. The method name surrounding the eval is not provided there but I believe it can be inferred from the binding. Is more direct access needed?

Then there's the actual return value, or what form the source is in. This is one area that's not clear to me. If this API is expected to be somewhat portable, I can only see a few possible options: raw source, lexer/parser events for that source as from ripper, or some third offline parsed result as from the parser gem.

This last one would be unusual to provide without making the related library available, and perhaps less useful for intercepting sources since no implementation directly consumes that format...we'd have to reverse the modified parsed output back into source.

Then there's ripper events, but they would only be useful for manipulating the parsed result if you can basically reproduce those events in some new order or composition and feed them back into the lex/parse logic. That functionality does not exist right now.

I think we end up with raw sources, so all you're really intercepting is the source after it has been resolved (in the case of load/require) and allowing it to be modified en route to the parser. Consumers of the API can then use some other parsing API (parser or ripper) or source-manipulation API to produce new source.

@enebo You had some concerns about the asymmetry of (file, src) -> src, but the caller already knows the filename and unless we're talking about being able to redirect load/require lookups to another file (directly, without loading that other file manually) I don't think it's necessary to do anything else. The filter functions translate from a given file with given source (and a starting line number and binidng in the eval case) into some other source to process for that file. Seems ok?

@MaxLap
Copy link
Author

MaxLap commented Jun 11, 2018

The short answers: deep-cover is more powerful than the Coverage module.

In a more detailed explanation of what deep-cover is:

It has branch coverage and is compatible with MRI 2.1 and up. We also go deeper than branch coverage, detecting coverage per node from the AST. Doing this allows us to detect when things were interrupted by an exception or a throw. So if you do hello.world, but there is no world method (or the hello method throws), we actually detect and show that world wasn't executed. (Yes, this feels somewhat more edge-case, but it is a generalization of branch coverage when you think of an exception/throw as a special form of branching).

Our branch coverage also considers short-circuiting operators, so || and &&, which are not part of the branch coverage introduced in MRI 2.5.

While a little bit raw, we also detect default method arguments that were never used. We have ideas for more things to cover once the foundation is more robust.

deep-cover has a mode where it replaces the Coverage module. It behaves the same with compatible output, but the coverage is stricter.

If deep-cover can work well with JRuby, then there would be no need for you guys to develop your own Branch coverage like MRI did. It's a bit of a shame that we were too late. If we started work earlier, it's possible that MRI wouldn't have spent time doing basically the same thing as us.


Coming back to the subject:

(file, src) -> src is basically what our core does, so that perfect for us. (And we use the parser gem as you suggest)

I don't think there was anything else I had to comment on. Your reasoning made sense.

@headius headius added this to the JRuby 9.2.1.0 milestone Jun 11, 2018
@enebo
Copy link
Member

enebo commented Jun 11, 2018

@headius my complaint about symmetry I am no longer bothered by it. It looked worse in the variation where I was also passing linenumber. In that case it feels like you should be able to tweak linenumber but filename I don't think you really could since loadservice already is adding an entry for it and so forth.

Ok good. I think our version should start as a module JRuby::Parser and if MRI and TR get behind this then we can extend whatever module is the standard one later once we come to agreement.

@headius
Copy link
Member

headius commented Jun 12, 2018

I'm having trouble with JRuby::Parser as the namespace. This is certainly related to the parser, but the hooks are explicitly fired before parsing. It's more like a load/require/eval hook.

I'm thinking it's a good time to introduce JRuby::VM as a module for APIs internal to JRuby. It matches up better with RubyVM in CRuby, and it is general enough to hold these hook methods.

We also still have JRuby::Util. Perhaps it's better to use that namespace rather than introduce a new one?

@headius
Copy link
Member

headius commented Jun 12, 2018

Ok, work in progress but I hit a snag. What encoding should the source string be in? Because the parser will not have run yet, we do not know if there's a magic comment. We could process just that much of the file, but I wonder whether we should expect the consumers of this API to do that processing and deal only in raw bytes.

If we are looking for equivalence, CRuby has already parsed the file and so it keeps all encodings of all strings in the iseq. That would indicate that we should provide a properly-encoded string to the callbacks. What then if they modify the string to have a different magic encoding comment?

The hooks are not difficult to add to LoadService. I'll push a prototype that just uses raw (BINARY) bytes for now.

@headius
Copy link
Member

headius commented Jun 12, 2018

Ok I've pushed an initial prototype. There's some limitations at this point.

  • We initiate the read of the source in many places, many different ways. The change in LibrarySearcher.ResourceLibrary seemed to be the only one needed to hook a require from the filesystem. The changes in LoadServiceResource I believe apply only to loads? It's unclear to me...this code has been refactored many times since I dug into it.
  • The hook must return a string. Due to a limitation in our Java integration subsystem, the RubyString-returning interface can not handle a RubyNil (nil) result and will just blindly cast the hook's result to RubyString.
  • The logic is a bit cumbersome, partly due to the above two items, and partly because there's no clean sequence to hook during loads and requires.
  • eval is not yet hooked.

@headius
Copy link
Member

headius commented Jun 12, 2018

Example output loading Rake:

$ jruby -I core/src/main/ruby/ -e 'JRuby::VM.add_load_hook {|file, src| puts "#{File.basename(file)} = #{src[0..19].inspect}"; src}; require "rake"'
rake.rb = "#--\n# Copyright 2003"
version.rb = "module Rake\n  module"
fileutils.rb = "# frozen_string_lite"
singleton.rb = "# frozen_string_lite"
optparse.rb = "# frozen_string_lite"
ostruct.rb = "# frozen_string_lite"
module.rb = "\n# TODO: remove in R"
string.rb = "require 'rake/ext/co"
core.rb = "class Module\n  # Che"
time.rb = "#--\n# Extensions to "
early_time.rb = "module Rake\n\n  # Ear"
late_time.rb = "module Rake\n  # Late"
win32.rb = "\nmodule Rake\n  requi"
alt_system.rb = "#\n# Copyright (c) 20"
linked_list.rb = "module Rake\n\n  # Pol"
cpu_counter.rb = "module Rake\n\n  # Bas"
scope.rb = "module Rake\n  class "
task_argument_error.rb = "module Rake\n\n  # Err"
rule_recursion_overflow_error.rb = "\nmodule Rake\n\n  # Er"
rake_module.rb = "require 'rake/applic"
application.rb = "require 'shellwords'"
shellwords.rb = "# frozen-string-lite"
task_manager.rb = "module Rake\n\n  # The"
file_list.rb = "require 'rake/clonea"
cloneable.rb = "module Rake\n  ##\n  #"
file_utils_ext.rb = "require 'rake/file_u"
file_utils.rb = "require 'rbconfig'\nr"
pathmap.rb = "# TODO: Remove in Ra"
thread_pool.rb = "require 'thread'\nreq"
set.rb = "# frozen_string_lite"
promise.rb = "module Rake\n\n  # A P"
thread_history_display.rb = "require 'rake/privat"
private_reader.rb = "module Rake\n\n  # Inc"
trace_output.rb = "module Rake\n  module"
pseudo_status.rb = "module Rake\n\n  ##\n  "
task_arguments.rb = "module Rake\n\n  ##\n  "
invocation_chain.rb = "module Rake\n\n  # Inv"
task.rb = "require 'rake/invoca"
invocation_exception_mixin.rb = "module Rake\n  module"
file_task.rb = "require 'rake/task.r"
file_creation_task.rb = "require 'rake/file_t"
multi_task.rb = "module Rake\n\n  # Sam"
dsl_definition.rb = "# Rake DSL functions"
default_loader.rb = "module Rake\n\n  # Def"
name_space.rb = "##\n# The NameSpace c"
backtrace.rb = "module Rake\n  module"

@headius
Copy link
Member

headius commented Jun 12, 2018

Oh, I see we probably want the full path to the file also.

@headius
Copy link
Member

headius commented Jun 12, 2018

Improved the path returned.

$ jruby -e 'JRuby::VM.add_load_hook {|file, src| puts "#{file} = #{src[0..19].inspect}"; src}; require "rake"'
/Users/headius/projects/jruby/lib/ruby/stdlib/rake.rb = "#--\n# Copyright 2003"
/Users/headius/projects/jruby/lib/ruby/stdlib/rake/version.rb = "module Rake\n  module"
/Users/headius/projects/jruby/lib/ruby/stdlib/fileutils.rb = "# frozen_string_lite"
/Users/headius/projects/jruby/lib/ruby/stdlib/singleton.rb = "# frozen_string_lite"
/Users/headius/projects/jruby/lib/ruby/stdlib/optparse.rb = "# frozen_string_lite"
/Users/headius/projects/jruby/lib/ruby/stdlib/ostruct.rb = "# frozen_string_lite"

Note that many of JRuby's own resources will be presented as URLs, such as uri:classloader:/jruby/set.rb in this case.

headius added a commit that referenced this issue Jun 12, 2018
@eregon
Copy link
Member

eregon commented Jun 12, 2018

Re namespace, I think RubyVM is supposed to be defined only on CRuby, so I guess we should use a different namespace.

@enebo
Copy link
Member

enebo commented Jun 12, 2018

@headius I don't care too much about using Parser but originally I had thought that load/require/eval are all either posthooks of loadservice or prehooks of parsing. Picking VM just broadens the API. If we do standardize with other Impls we cannot 'extend JRuby::VM' into the standardized name. OTOH we can just duplicate this stuff into whatever name is chosen or extend from whatever is chosen back to JRuby::VM.

@enebo
Copy link
Member

enebo commented Jun 12, 2018

@headius in retrospect does require hook vs load hook even make sense? I see you added load_hook but I guess a require internally causes a load.

@headius
Copy link
Member

headius commented Jun 12, 2018

we cannot 'extend JRuby::VM' into the standardized name

I don't understand this. How is JRuby::VM any different from JRuby::Parser in this regard?

In any case I figured if this feature (source to source pre-translation) were ever approved for standard Ruby, we'd just call our method from the actual method.

does require hook vs load hook even make sense

As far as actually loading and parsing the code there's no difference, so maybe not? Load and require differ only in how they search for the file (load requires file extension) and whether they update LOADED_FEATURES (load does not). Neither of these have much relevance for source-to-source-translation before parsing.

load/require/eval are all either posthooks of loadservice or prehooks of parsing

As pre-hooks it does make some sense to be JRuby::Parser. I'm struggling with naming of the method though. preparse? parser_hook or hook_parser? I lean toward longer names but that's not the Ruby way; add_parser_filter or add_source_filter? JRuby::Parser.preprocess?

@enebo
Copy link
Member

enebo commented Jun 12, 2018

JRuby::VM is only different in that I can see us adding more methods unrelated to this particular feature as it is pretty generic namewise. With that said, of course we can just call into this method with a thin wrapper method. I just found if we bounded the callbacks into a single namespace it would be more aesthetically pleasing to integrate into a standard later. No big deal though.

We also want to potentially designate between eval and file(require/load). Not sure on good name either. I would probably choose a slightly longer name as this is more internal for Ruby (and we have precedence for things like abort_on_exception).

@headius
Copy link
Member

headius commented Jun 12, 2018

if we bounded the callbacks into a single namespace it would be more aesthetically pleasing to integrate

I'd be reluctant to mix in a JRuby::Whatever module directly into an official user-facing API anyway. In any case I doubt the official method names (if they ever happen) would match ours.

add_load_hook and add_eval_hook?

@enebo
Copy link
Member

enebo commented Jun 12, 2018

Yeah sounds good

@eregon
Copy link
Member

eregon commented Jun 13, 2018

When thinking about names for this feature, I immediately think it "transforms the Ruby source", right before it's parsed.
"load hook" sounds more general and might be a used to e.g. just log load/require paths as they happen, but not use or touch the source at all.

@headius
Copy link
Member

headius commented Jun 13, 2018

@eregon Yeah I had that concern too...like a load hook would be used to redirect to a different file or change the lookup process.

I'm thinking along the lines of add_source_transform.

@enebo
Copy link
Member

enebo commented Jun 13, 2018

transform_source(src, file, is_eval)

@headius
Copy link
Member

headius commented Jun 26, 2018

@enebo Ok so the remaining problem I have is all the places where we do parsing of source. The tweaks I have in my commit above feel hacky and too spread out, and I only covered the few cases that I saw during my experiment. Maybe you have some thoughts here? Maybe we need some refactoring to funnel these disparate paths through fewer places?

@MaxLap
Copy link
Author

MaxLap commented Nov 22, 2018

Time flies! Could we have an update? Any idea when you may have the time to work on this?

@enebo enebo modified the milestones: JRuby 9.2.5.0, JRuby 9.2.6.0 Dec 6, 2018
@headius
Copy link
Member

headius commented Dec 18, 2018

@MaxLap Hey there, we've been kinda heads-down just trying to stabilize 9.2 the past couple months. I haven't forgotten this but we have not had time to look into it. Maybe we could chat about you trying to prototype something with us?

@headius headius modified the milestones: JRuby 9.2.6.0, JRuby 9.3.0.0 Dec 18, 2018
@eregon
Copy link
Member

eregon commented Dec 27, 2018

@MaxLap This feature somewhat relates to https://bugs.ruby-lang.org/issues/15287#note-10
I think it would be useful to create a feature request on https://bugs.ruby-lang.org/ to have a standard way to do this.

load_iseq, or rather the actual method now named RubyVM::InstructionSequence.load_from_binary seems a fairly indirect way to achieve this (notably, interacting with bytecode when there should be no need), and I'm not sure it's even documented how it works and whether it will stay compatible with the current behavior.
https://bugs.ruby-lang.org/issues/11788#note-3 mentions:

This interface is not matured and is not extensible.
So that we don't guarantee the future compatibility of this method.
Basically, you shouldn't use this method.

@ioquatix
Copy link
Contributor

ioquatix commented Apr 8, 2019

I've personally used load_iseq in ScoutApm (I implemented the automatic instrumentation by code rewriting). I think it's a fairly crappy interface and I'd like to see something much better standardised. The InstructionSequence class can be completely opaque - just need access to source code for rewriting. Just FYI.

@eregon
Copy link
Member

eregon commented Apr 8, 2019

The InstructionSequence class can be completely opaque - just need access to source code for rewriting. Just FYI.

If we are rewriting Ruby code as a String to Ruby code as a String, then InstructionSequence wouldn't be a good name for it.
Actually, RubyVM::InstructionSequence is currently the worst possible namespace to add a portable feature, I'm proposing the ExperimentalFeatures namespace in https://bugs.ruby-lang.org/issues/15752

@headius
Copy link
Member

headius commented Apr 7, 2021

Not much has changed here. CRuby continues to publish CRuby-specific features under RubyVM, which people then use and turn into de-facto standard features. If anyone more familiar with this particular feature can provide an update, it would be very helpful to me. As it stands this will likely get punted to 9.4 or closed altogether.

@headius headius modified the milestones: JRuby 9.3.0.0, JRuby 9.4.0.0 Jul 1, 2021
@headius headius modified the milestones: JRuby 9.4.0.0, JRuby 9.5.0.0 Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants