back.pysim performance improvements #348

sjolsen · 2020-04-07T08:07:35Z

I'm not sure how high a priority the Python simulation backend is or whether it's worth making the structural changes to hdl.ast to optimize hashing, but I figured I would share what I've figured out so far in case it's useful. Please let me know what you think.

The changes here fall roughly into the following categories:

Fixing what I believe to be the intent behind curr/next comparisons, and avoiding doing work where (I think) it's known there is no work to be done
Optimizing the hashing of Value/Signal/Statement key objects, since this is done in a few very frequently traversed code paths
Caching compilation results (from hdl.ast to what I believe is Python bytecode)
Micro-optimizations. I think the only one that made it through was the VCD hash table lookup, and it's pretty marginal at this point

I used the following simulation to make performance measurements: sim_perf_test.py. I have perf data for each commit in the chain, both with and without VCD output enabled. I can see about uploading them somewhere if there's interest, but in the meantime here's a summary:

With no VCD output:

Commit	Description	Time elapsed (s)	Reduction (incr.)	Reduction (cum.)
`bb1bbcc`	hdl.mem: fix source location of ReadPort.en.	116.8	--	--
`52e30bc`	back.pysim: Fix curr=next optimizations	90.77	22.29%	22.29%
`6902c99`	hdl.ast: Intern value/statement keys off the hot path	80.74	11.05%	30.87%
`642e477`	back.pysim: Cache compilation results	32.17	60.16%	72.46%
`d1fe3c6`	back.pysim: Reuse clock simulation commands	25.29	21.39%	78.35%
`6540568`	back.pysim: Eliminate duplicate dict lookup in VCD update	24.88	1.62%	78.70%

With VCD output enabled:

Commit	Description	Time elapsed (s)	Reduction (incr.)	Reduction (cum.)
`bb1bbcc`	hdl.mem: fix source location of ReadPort.en.	152.6	--	--
`52e30bc`	back.pysim: Fix curr=next optimizations	127.4	16.51%	16.51%
`6902c99`	hdl.ast: Intern value/statement keys off the hot path	118	7.38%	22.67%
`642e477`	back.pysim: Cache compilation results	66.94	43.27%	56.13%
`d1fe3c6`	back.pysim: Reuse clock simulation commands	57.16	14.61%	62.54%
`6540568`	back.pysim: Eliminate duplicate dict lookup in VCD update	53.25	6.84%	65.10%

I did run the nMigen unit tests, but I haven't figured out yet how to set up sby on Windows so I don't know whether those tests are broken by my changes.

whitequark · 2020-04-07T08:17:59Z

Fixing what I believe to be the intent behind curr/next comparisons

Curr/next comparisons implement the two-phase simulation algorithm that is deterministic in presence of asynchronous code. I looked at your fix and it will lead to incorrect simulation results. I can explain the algorithm if necessary.

Optimizing the hashing of Value/Signal/Statement key objects, since this is done in a few very frequently traversed code paths

I looked at your changes and while the intent seems reasonable, the patch seems fairly invasive at first glance. I'll take another look later.

Caching compilation results (from hdl.ast to what I believe is Python bytecode)

Extremely good change, one I was going to make for some time but never got around to.

Micro-optimizations

Also looks all reasonable.

To summarize, 642e477, d1fe3c6 and 6540568 are mergeable as-is, 6902c99 needs more discussion, and 52e30bc is unsound.

codecov · 2020-04-07T08:19:33Z

Codecov Report

Merging #348 into master will increase coverage by 0.47%.
The diff coverage is 84.47%.

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   82.69%   83.17%   +0.47%     
==========================================
  Files          35       35              
  Lines        5919     5991      +72     
  Branches     1201     1179      -22     
==========================================
+ Hits         4895     4983      +88     
+ Misses        863      859       -4     
+ Partials      161      149      -12

Impacted Files	Coverage Δ
nmigen/hdl/ast.py	`90.97% <82.87%> (+2.44%)`	⬆️
nmigen/back/pysim.py	`91.16% <100.00%> (+0.10%)`	⬆️
nmigen/hdl/rec.py	`98.73% <100.00%> (+0.03%)`	⬆️
nmigen/build/run.py	`31.25% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e40dc0...f38654f. Read the comment docs.

whitequark · 2020-04-07T08:21:08Z

nmigen/hdl/ast.py

@@ -124,6 +124,14 @@ def cast(obj):
            return Const(obj.value, Shape.cast(type(obj)))
        raise TypeError("Object {!r} cannot be converted to an nMigen value".format(obj))

+    @staticmethod
+    def castable(obj):


A patch dedicated to optimization should not introduce a new public API.

sjolsen · 2020-04-07T08:39:36Z

Thanks!

Curr/next comparisons implement the two-phase simulation algorithm that is deterministic in presence of asynchronous code. I looked at your fix and it will lead to incorrect simulation results. I can explain the algorithm if necessary.

I'm not sure I understand, but it's easy enough to revert. Would you prefer I modify the history to keep it linear or layer revert commits etc. on top of what's already here?

I looked at your changes and while the intent seems reasonable, the patch seems fairly invasive at first glance. I'll take another look later.

Yeah, I wasn't entirely sure how to make this less invasive. A big part of the performance gain was from eliminating glue logic from what was previously ValueKey/SignalKey.

To summarize, 642e477, d1fe3c6 and 6540568 are mergeable as-is, 6902c99 needs more discussion, and 52e30bc is unsound.

As it is, 642e477 depends on 6902c99 -- that was the motivation for the changes to ValueKey. I actually started with repr(value) as the key, but there was still a fair bit of overhead there.

A patch dedicated to optimization should not introduce a new public API.

Should that specific method be named _castable? Or should it go somewhere else?

whitequark · 2020-04-07T08:51:55Z

I'm not sure I understand, but it's easy enough to revert.

The basic idea is that:

eval only looks at curr and only modifies next;
commit is the only one who can assign from next to curr;
the eval/commit loop continues while there are any changes.

The reason it's arranged like that is because rules (1) and (2) ensure that no matter in which order eval and commit are run (i.e. no matter how any particular netlist is translated, how it's arranged into modules, etc), each full eval/commit cycle will have the exact same result. Rule (3), by iterating to fixed point, ensures that all updates in a netlist fully propagate before advancing time--for example, consider a case where you have two domains with asynchronous reset chained, and you reset the first one. This will require two delta cycles (eval/commit calls).

No part of this scheme is an "optimization"; it's solely used for correctness.

Would you prefer I modify the history to keep it linear or layer revert commits etc. on top of what's already here?

I strongly prefer keeping the history linear.

Yeah, I wasn't entirely sure how to make this less invasive. A big part of the performance gain was from eliminating glue logic from what was previously ValueKey/SignalKey.

Should that specific method be named _castable? Or should it go somewhere else?

I'm not sure yet. I'll need to understand your proposed changes in detail to see if there is a less invasive way to do the same thing, or if there's really no other choice.

sjolsen · 2020-04-07T09:18:17Z

I reverted the unsound curr/next change and rearranged the interning API to conserve the existing ValueKey and SignalKey classes. I think all the new abstractions should now be hidden.

whitequark · 2020-04-07T09:29:40Z

Thanks. I've cherry-picked the second two commits. I expect it will take a while to convince myself that the proposed interning changes are the way forward. I appreciate that you rearranged them to have no impact to the public API, but the bulk of my concern has to do with the increased internal complexity as well as soundness in face of mutability of AST nodes: I am hesitant to introduce any sort of persistent hashing while properties of most AST nodes remain writable.

Have you considered adding fast paths to the simulator, for the case where bench code operates on simple signals alone? This removes the need for caching bytecode, is an easy to understand change with minimal impact, and should yield most of the performance benefit.

sjolsen · 2020-04-07T10:04:44Z

I expect it will take a while to convince myself that the proposed interning changes are the way forward

Understood. I'm not necessarily expecting to get this specific design upstreamed, it's just where I ended up after probing for performance improvements.

the bulk of my concern has to do with the increased internal complexity as well as soundness in face of mutability of AST nodes: I am hesitant to introduce any sort of persistent hashing while properties of most AST nodes remain writable.

That makes sense. Do you anticipate the AST becoming immutable?

Have you considered adding fast paths to the simulator, for the case where bench code operates on simple signals alone? This removes the need for caching bytecode, is an easy to understand change with minimal impact, and should yield most of the performance benefit.

I'm not sure I follow. Are you thinking along the lines of falling back to a simpler scheduler if only the default sync and comb domains are used? Or did you have something else in mind?

sjolsen · 2020-04-07T11:42:25Z

The basic idea is that [...]

Thank you for explaining this, BTW. The eval/commit/loop approach makes perfect sense to me. I think I see at least one way the unsoundness of 52e30bc could manifest: if one invocation sets next != curr, and a second invocation sets it back to curr, it will be left with the wrong value for next. That's important because a single fragment could set next repeatedly.

I retraced my steps and realized that the call to pending.clear was the only part of that commit that actually mattered for performance. The rest of the changes were a result of trying to micro-optimize _SignalState.commit. If I'm not (still) very much mistaken, calling pending.clear should be correct since at that point in _SimulatorState.commit, every pending update from the current eval/commit loop has been effected, and the next eval step will set pending if necessary. What do you think?

whitequark · 2020-04-07T11:45:27Z

Do you anticipate the AST becoming immutable?

This is a bit of a complex issue. At the very beginning, I designed nMigen so that it never mutates ASTs (and it still doesn't). When it needs to change an AST, it emits a new one with the necessary elements changed (the FragmentTransformer stuff). I knew this could be a performance issue when writing it, but I didn't realize how just how slow it would be.

This slowness could be fixed in two ways: a simpler but more flawed one--which is to copy the entire AST once, eliminating any sharing, and then mutating it; and a more complex but principled one--which is to not change ASTs in first place (with or without mutation), and instead treat ClockSignal, ResetSignal, domain names, etc similar to variable bindings. That is, instead of having DomainRenamer alter the actual AST, it would alter a binding table carried along with it, with this table specifying which domain becomes which after every transformation is applied.

The principled approach potentially has very significant benefits. It would dramatically improve performance on all designs but especially large generic ones. It would eliminate multiple complicated corner cases from the internal machinery. It would bring the internals of nMigen closer to well understood and well behaved compiler development principles, namely using variable bindings for late bound names (there's nothing wrong in principle with term rewriting it's currently doing, but this term rewriting is very ad-hoc and hard to analyze). It's also a lot of work.

There is a separate issue, which is the user mutability of ASTs, whether interior or exterior. For most expression nodes (everything except Signal, ArrayProxy, and UserValue), this is clearly undesirable. For ArrayProxy and UserValue, it's just interior mutability, and there's already code to cope with it, which should be largely sufficient. For Signal, the story is a bit complex--I believe it's never been "officially" supported, but you can mutate Signals and the results are currently correct, provided you never use them in expressions before you stop mutating them. This is used internally in nMigen in the FSM generator code, so it seems like a valuable capability and we might want to enforce that invariant in a future immutable AST.

To summarize, I think the AST should be immutable, but we're not quite there yet, and there is at least one corner case that currently presents an API with a soundness issue that would likely need to be handled explicitly.

whitequark · 2020-04-07T11:47:41Z

That's important because a single fragment could set next repeatedly.

Yup, and there are many cases where this happens--for example each time you assert reset.

If I'm not (still) very much mistaken, calling pending.clear should be correct since at that point in _SimulatorState.commit, every pending update from the current eval/commit loop has been effected, and the next eval step will set pending if necessary. What do you think?

Could you show a patch that does what you want? I'm not quite sure exactly what change you're referring to, and it seems like preparing a patch isn't a lot of effort here.

whitequark · 2020-04-07T12:07:16Z

back.pysim: Clear pending updates after they are effected

Oh... did I just completely forget to remove elements from self.pending anywhere at all? Well that would make things a lot slower than they should be!!

whitequark · 2020-04-07T12:08:00Z

I'm curious, could you measure the perf impact? I'm really glad you noticed this, it's a really silly mistake with severe consequences.

sjolsen · 2020-04-08T00:08:30Z

I also cherry-picked it onto master:

* 555c23c8 back.pysim: Clear pending updates after they are effected
| * d891a794 back.pysim: Clear pending updates after they are effected
| * daa616f0 back.pysim: Eliminate duplicate dict lookup in VCD update
| * 1b60dbda ...
| * d698e6f4 ...
| * 3f466e89 ...
|/
* bb1bbcc5 hdl.mem: fix source location of ReadPort.en.

Here's what I got with VCD output disabled:

Commit	Description	Time elapsed (s)	Reduction (incr.)	Reduction (cum.)
`bb1bbcc`	hdl.mem: fix source location of ReadPort.en.	152.6	--	--
...	...	...	...	...
`daa616f`	back.pysim: Eliminate duplicate dict lookup in VCD update	47.24	...	69.04%
`d891a79`	back.pysim: Clear pending updates after they are effected	25.05	46.97%	83.58%

Commit	Description	Time elapsed (s)	Reduction (incr.)	Reduction (cum.)
`bb1bbcc`	hdl.mem: fix source location of ReadPort.en.	152.6	--	--
555c23c8	back.pysim: Clear pending updates after they are effected	87.41	42.72%	42.72%

whitequark · 2020-04-08T14:11:25Z

I've cherry-picked commit d891a79, thank you!

I'm not sure I follow. Are you thinking along the lines of falling back to a simpler scheduler if only the default sync and comb domains are used? Or did you have something else in mind?

Something else. You can notice that most simulator commands are along the lines of yield sig or yield sig.eq(123). Which means that by special-casing just these two patterns we can dramatically speed up simulation without any of the interning complexity.

I think @jordens proposed (in #228) an improvement to the simulator interface that would make all commands except for these two impossible to express. I'm not yet fully convinced that this is the way to go, and we'll have to keep the existing code for Migen compatibility, but you can see that this special case is less narrow than it looks at first.

sjolsen · 2020-04-09T00:55:18Z

I integrated the cherry-picked changes to clean up the PR in case you decide to do something with it. One thing that occurred to me that I haven't bothered to change is that the compilation cache should probably be scoped to the simulator rather than the coroutine process.

Something else. You can notice that most simulator commands are along the lines of yield sig or yield sig.eq(123). Which means that by special-casing just these two patterns we can dramatically speed up simulation without any of the interning complexity.

That makes a lot of sense. Speaking of the simulator interface, something I've thought of requesting is a way to wait asynchronously for a signal change from a coroutine process. It sounds from #228 like you already have some idea of how you'd like to do that? FWIW I found Value hashing useful for hacking together (emphasis on "hack") an implementation using _FragmentCompiler.

whitequark · 2020-04-09T03:52:02Z

It sounds from #228 like you already have some idea of how you'd like to do that?

Internally, this is easy to do. The external API needs to be designed.

whitequark · 2020-04-15T08:31:55Z

As per discussion above, I am closing this PR; I would encourage you (or anyone else) to submit another PR with a less invasive approach I suggested above. If it turns out that (contrary to my expectations) the simpler approach does not bring similar performance benefits, then I'm going to reevaluate the approach in this PR. Thanks for doing this work!

whitequark reviewed Apr 7, 2020

View reviewed changes

sjolsen force-pushed the perf branch from 6540568 to daa616f Compare April 7, 2020 09:16

sjolsen added 2 commits April 8, 2020 17:29

hdl.ast: Intern value/statement keys off the hot path

Unverified

This user has not yet uploaded their public signing key.

GPG key ID: 5BB1DDBD39060507

Learn about vigilant mode

c6a735a

back.pysim: Cache compilation results

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

f38654f

sjolsen force-pushed the perf branch from d891a79 to f38654f Compare April 9, 2020 00:32

whitequark closed this Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

back.pysim performance improvements #348

back.pysim performance improvements #348

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020 •

edited

Loading

codecov bot commented Apr 7, 2020 •

edited

Loading

whitequark Apr 7, 2020

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020 •

edited

Loading

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020

sjolsen commented Apr 7, 2020

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020

whitequark commented Apr 7, 2020

whitequark commented Apr 7, 2020

whitequark commented Apr 7, 2020

sjolsen commented Apr 8, 2020

whitequark commented Apr 8, 2020 •

edited

Loading

sjolsen commented Apr 9, 2020

whitequark commented Apr 9, 2020

whitequark commented Apr 15, 2020

back.pysim performance improvements #348

back.pysim performance improvements #348

Conversation

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020 • edited Loading

codecov bot commented Apr 7, 2020 • edited Loading

Codecov Report

whitequark Apr 7, 2020

Choose a reason for hiding this comment

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020 • edited Loading

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020

sjolsen commented Apr 7, 2020

sjolsen commented Apr 7, 2020

whitequark commented Apr 7, 2020

whitequark commented Apr 7, 2020

whitequark commented Apr 7, 2020

whitequark commented Apr 7, 2020

sjolsen commented Apr 8, 2020

whitequark commented Apr 8, 2020 • edited Loading

sjolsen commented Apr 9, 2020

whitequark commented Apr 9, 2020

whitequark commented Apr 15, 2020

whitequark commented Apr 7, 2020 •

edited

Loading

codecov bot commented Apr 7, 2020 •

edited

Loading

whitequark commented Apr 7, 2020 •

edited

Loading

whitequark commented Apr 8, 2020 •

edited

Loading