Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principia is slow on macOS (possibly due to Unity's allocator) #2899

Closed
rnlahaye opened this issue Feb 24, 2021 · 8 comments
Closed

Principia is slow on macOS (possibly due to Unity's allocator) #2899

rnlahaye opened this issue Feb 24, 2021 · 8 comments

Comments

@rnlahaye
Copy link
Contributor

TLDR: Unity seems to have a custom allocator that uses mutexes. Mutexes are slow on macOS. Consequently, on macOS the vast majority of Principia's time seems to be spent on memory management.

After getting back into KSP after a hiatus, I found the game to have poor performance. Running a trace led me to discover a bug in Vessel::RepeatedlyFlowPrognostication but even after fixing it (#2898), the poor performance persisted. Sometimes the game would even pause completely for several seconds. Further traces revealed the problem was due to four threads concurrently evaluating OrbitAnalyser::AnalyseOrbit. Profiling an AnalyseOrbit benchmark revealed nothing amiss.
However, I was able to get a flame graph (attached) of a running game which revealed the problem:
Screen Shot 2021-02-23 at 6 05 12 PM
The regions highlighted in magenta are mutexes used by functions in UnityPlayer.dylib. Based on the placement of the UnityPlayer.dylib functions in the call stack (and the fact that one of them is operator new), I suspect they are a custom allocator used by Unity. Unfortunately, the performance of the stock mutex on macOS is known to be bad (#1955). Thus, terrible performance. The reason this was not apparent in benchmarks is probably because the benchmarks are not run from Unity and hence use the system allocator. This isn't restricted to AnalyzeOrbit either; the rest of Principia was also affected. I estimate that over 80% of Principia's CPU cost on my machine is spent managing these mutexes!

Something similar was diagnosed and fixed in #1955. Unfortunately, this time the slow mutexes are not Principia's but Unity's mutexes. Replacing the misbehaving mutexes with absl::Mutex is not really an option here.

I will try to fix this by forcing Principia to use the system allocator even when run from Unity. Stay tuned.

Possibly related to #2247.

@pleroy
Copy link
Member

pleroy commented Feb 24, 2021

This is a very interesting finding, and it would explain why people keep complaining about performance on macOS (we don't play on that platform).

What is not clear to me is if the Great Mutex of Death™ is in the Unity custom allocator or deeper, in the system malloc. You seem to have written a benchmark for AnalyzeOrbit, did you assert that it's working fine with multiple threads (and no Unity in sight)? Surely "Unity is stupid" is the most parsimonious explanation, but macOS has been known to make some bad decisions too.

@rnlahaye
Copy link
Contributor Author

Update: I modified the AnalyseOrbit benchmark to use multiple threads. It remained well-behaved.

@rnlahaye
Copy link
Contributor Author

I was able to replace the allocators in VesselSet and DiscreteTrajectoryTraits::Timeline with the macOS system allocator (using the allocator template parameter). I verified that the "new" allocator was being called and that it was working efficiently as intended. Here is a new version of the previous flame graph:
improved_flamegraph
The large tower in the right side of the old graph has completely vanished, and the smaller towers on the left and in the center have shrunk significantly.
Here is the same graph with the new alloc code highlighted:
new_alloc
It is much less expensive than the old code, as desired.

The next step is of course to replace the allocator in all the other places. I will experiment with this tomorrow.

@pleroy
Copy link
Member

pleroy commented Feb 25, 2021

Very nice progress, congratulations!

The next step is of course to replace the allocator in all the other places. I will experiment with this tomorrow.

Note that it may or may not be necessary to change all the containers: my guess is that a small number of data structures (mostly ContinuousTrajectory and DiscreteTrajectory) would give you 90% of the benefits. It's hard to believe that, for instance, the allocation of Manœuvres in the FlightPlan would cost anything (after all, they are just a reflection of UI actions). On the other hand, the introduction of a custom allocator is probably not that intrusive, so it might make sense to be systematic. I'll trust your judgement.

@rnlahaye
Copy link
Contributor Author

I tried overloading global operator new and operator delete. This did not work; KSP crashed trying to load the main menu when my allocator tried to free some object allocated by Unity. I went ahead and overloaded the allocator template parameter in more places. It's messy, but it works. Almost there.
almost

@rnlahaye
Copy link
Contributor Author

I was able to replace std::allocator with custom allocator without modifying the existing source by first defining some aliases in namespace principia::std in a header file, and then including that header file via compiler argument. This produces a similar flame graph to one from the previous comment but with much less mess.

Unfortunately, unique_ptrs are not affected by this approach as they do not use allocators (they only contain a deleter template parameter). I don't see any way to safely fix the unique_ptrs without changing std::unique_ptr to some non-std unique_ptr in various places—even if I somehow override the default deleter, there is no guarantee the pointer passed into the unique_ptr was constructed using the correct allocator. Luckily I can get away with doing this only in a few places, as @pleroy notes.

@rnlahaye
Copy link
Contributor Author

I've gotten to a state that I'm happy with (I no longer see Unity code being called by Principia code in traces). The only necessary unique_ptr replacements were those pertaining to ContinuousTrajectory and DiscreteTrajectory, as predicted. I'll clean up my changes and send them for review.

The game is much more playable now :)

@pleroy
Copy link
Member

pleroy commented Jan 29, 2022

I am going to close this bug because we agreed that #2901 and #2906 were problematic, so there is probably not much more that can be done regarding the Unity allocator.

@pleroy pleroy closed this as completed Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants