Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building on MacOS fails with "bus error: 10" #9386

Closed
TrueBrain opened this issue Jun 19, 2021 · 17 comments · Fixed by #9391 or #9421
Closed

Building on MacOS fails with "bus error: 10" #9386

TrueBrain opened this issue Jun 19, 2021 · 17 comments · Fixed by #9391 or #9421

Comments

@TrueBrain
Copy link
Member

TrueBrain commented Jun 19, 2021

Version of OpenTTD

Every commit after bf500c3

Expected result

Being able to compile on MacOS with RelWithDebInfo.

Actual result

Compiling on MacOS fails with "Bus error: 10" when using RelWithDebInfo.

Steps to reproduce

Build on MacOS. Or see our nightly .. it fails every night.

@TrueBrain
Copy link
Member Author

TrueBrain commented Jun 19, 2021

I did some initial investigation into this, and there are a few things to observe:

Bisecting shows the above commit is the first commit after which it breaks. This doesn't mean it is the direct cause, but I will get back to that in a bit.
"Bus error: 10" on MacOS means, from what I understand: cannot allocate memory. It happens during linking of clang, and more specifically, I strongly suspect it happens during LTO.

Why during LTO? Well, when I compile on GCC with RelWithDebInfo I get a warning about var-tracking-assignments, where it exceeded the threshold to track that many variables. And this started to happen on the same commit.

So I would reason the two events are related. We might already have been pushing settings.cpp to the limits of compilers, as it contains the full settings table and the file itself is already large too. But since the above mentioned commit, things seem to have reached a tipping point. GCC started to throw a warning that it really doesn't fancy tracking so many variables .. and MacOS seems to just crash, most likely because it ran out of memory.

I have not looked into solutions; I just did some initial triage on this problem. I am 99.99% sure that commit is causing the issue in a direct sense, but it might already have been an issue for years, just not showing up yet as it didn't hit the thresholds :)

@James103
Copy link
Contributor

James103 commented Jun 19, 2021

I notice that the MacOS builds failing also results in the releases for Windows and Linux builds being cancelled for that specific nightly. In other words, currently (as of the time of writing), no-one can test the newest features of OpenTTD without compiling the source code themselves.

This makes this a high-priority if not critical issue from the perspective of players and testers who want the latest features as soon as possible, regardless of the OS they play on.

@TrueBrain
Copy link
Member Author

As experiment I removed ~200 settings from our settings.ini. With this, the game indeed compiles again on MacOS. So I started to minimize how many I had to remove before it started to crash again .. seems like the number is >18 and < 50.

So yeah, that kinda confirms we hit some kind of upper-limit here :D

@James103
Copy link
Contributor

How is JGRPP able to still build on macOS given that JGRPP already adds dozens of new settings on top of the base game settings? Is it that @JGRennison has more memory on his MacOS compiling machines or is there a different trick used by JGRPP which lets the setting count scale indefinitely without things breaking?

@TrueBrain
Copy link
Member Author

How is JGRPP able to still build on macOS given that JGRPP already adds dozens of new settings on top of the base game settings? Is it that @JGRennison has more memory on his MacOS compiling machines or is there a different trick used by JGRPP which lets the setting count scale indefinitely without things breaking?

Nothing can scale indefinitely, so that is not true for sure ;) But I am not saying that the amount of settings is the issue. Just saying that since that commit there are too many variables to track. And that we are now just over the limit clang can handle. And I can proof that by removing ~50 settings, and it works again. This can have been dormant for years, or it can be that the above mentioned commit tripled the amount of variables a compiler checks. I dunno atm :) Reducing the settings is a means to an end, nothing else.

@glx22
Copy link
Contributor

glx22 commented Jun 19, 2021

When I tested some stuff it was very weird, I could build 8dd846b (latest know buildable nightly) using release workflow, but same commit plus a little CI workflow modification (just switching build mode to RelWithDebInfo and disabling other OS) fails.

@JGRennison
Copy link
Contributor

How is JGRPP able to still build on macOS given that JGRPP already adds dozens of new settings on top of the base game settings? Is it that @JGRennison has more memory on his MacOS compiling machines or is there a different trick used by JGRPP which lets the setting count scale indefinitely without things breaking?

At present bf500c3 and many other recent commits from master are not in my branch. I have not done anything specifically on this issue.

@James103
Copy link
Contributor

Would it be possible to temporarily drop support for MacOS (on the front end) until this issue can be solved, so that at least Windows and Linux users can continue to enjoy the newest features via the nightly releases?

@glx22
Copy link
Contributor

glx22 commented Jun 19, 2021

It should be okay to allow failure, and still upload working bundles, for AWS, but I don't know for steam.

@TrueBrain
Copy link
Member Author

It is not wise to skip MacOS just because we broke it; we could always reverse the patch that broke it. But even there, I think we are better off trying to figure out what causes the problem and how to solve it.

I think that fixing the GCC issue also fixes the clang issue, so there are more platforms to test solutions on, as I strongly suspect they both point to the same issue. But I can verify that claim tomorrow with some more triage :) Sadly, the GCC issue only shows up with LTO, which takes a very long time :P So debug-sessions will be slow :)

@TrueBrain
Copy link
Member Author

See #9389, but basically on a hunch I made a change that seems to work. I still have 0 clues what causes this problem, and I can only guess why this change fixes the issue. So any insights would be appreciated, but this might be a stop-gap solution to get the builds going again.

@TrueBrain
Copy link
Member Author

Bit out of my league here, but this I find interesting:

https://godbolt.org/z/1hao5fafT

Mind the delete operator. If you remove it, you will see that at the bottom in the assembly GCC adds a bunch of "delete" and "unwind" statements. These instructions how-ever are never jumped to (mind the jmp just before the block and no jump into it anymore). In other words: it seems to generate dead-code.

When using -O2 it still optimizes that code, and make sit even worse, with a lot of jump-labels and I don't know what more.

So I think in the core of things, this is the problem we run into. The compilers fail to see that this memory will never be destructed normally, and in result start to do all kind of weird shit :P

The godbolt adds a simple delete operator that is delete, which removes the statements. And if I apply that on our code-base, the problem indeed goes away, even without changing name to a std::string_view. Also explains a bit why that works, as with a std::string_view there are no variable to delete, and no delete operator is generated (well, an empty one, but who is counting).

So the way we create the tables seems to be much more the issue than the name field itself.

A few commits before this commit mention above, we changed the settings system to use subclasses to indicate type. This also meant we switched to new ... to fill the array. And this seems to have caused the delete operator to be added.

As I mentioned earlier, this really feels like we kept adding stuff up, and the above mentioned commit was just the tipping point.

Although the solution in godbolt works also in our code-base, it generates a lot of warnings about a virtual dtor .. so not sure if we can apply this to our code-base, but at least I hope this analysis helps someone to say: owh, but we should just do .... :D

TrueBrain added a commit to TrueBrain/OpenTTD that referenced this issue Jun 20, 2021
…nfused

SettingDesc is only used in static lists, so never destructed
anyway (well, on close of course, but that the OS will clean up
for us nicely). And having a dtor, which the compiler sees is
never used btw, causes compilers to start optimizing that, and ..
get lost somewhere.

GCC complaints with:
  note: variable tracking size limit exceeded with -fvar-tracking-assignments, retrying without

clang on MacOS just crashes (with bus error 10, read: out of memory).

This commit produces tons of warnings, as it is not really meant
to be accepted upstream. More to show that the problem of OpenTTD#9386
is not just "std::string" -> "std::string_view", but much deeper
in fact.
rubidium42 added a commit to rubidium42/OpenTTD that referenced this issue Jun 20, 2021
@glx22
Copy link
Contributor

glx22 commented Jun 20, 2021

So we have too much stuff defined in global scope basically. So many things to refactor :)

@glx22
Copy link
Contributor

glx22 commented Jun 21, 2021

I tried something in master...glx22:my_try_at_9386, but that just moves the location of gcc var-tracking-assignments note closer to _settings definition.

rubidium42 added a commit to rubidium42/OpenTTD that referenced this issue Jun 21, 2021
rubidium42 added a commit to rubidium42/OpenTTD that referenced this issue Jun 21, 2021
…ants instead of new + unique_ptr

With std::variant all memory can be figured out at compile time, so the compiler needs to keep track of fewer elements. It also saves out a unique_ptr and its memory management, over a slight impact for resolving a setting.
rubidium42 added a commit to rubidium42/OpenTTD that referenced this issue Jun 22, 2021
…ants instead of new + unique_ptr

With std::variant all memory can be figured out at compile time, so the compiler needs to keep track of fewer elements. It also saves out a unique_ptr and its memory management, over a slight impact for resolving a setting.
rubidium42 added a commit to rubidium42/OpenTTD that referenced this issue Jun 22, 2021
…ants instead of new + unique_ptr

With std::variant all memory can be figured out at compile time, so the compiler needs to keep track of fewer elements. It also saves out a unique_ptr and its memory management, over a slight impact for resolving a setting.
TrueBrain pushed a commit that referenced this issue Jun 26, 2021
…stead of new + unique_ptr

With std::variant all memory can be figured out at compile time, so the compiler needs to keep track of fewer elements. It also saves out a unique_ptr and its memory management, over a slight impact for resolving a setting.
@TrueBrain
Copy link
Member Author

TrueBrain commented Jul 5, 2021

And ... this problem is back. Exact same MO, GCC complains, MacOS clang crashes.

Introduced by cdb3dd0 .. and the solutions of last time are not really going to work. The handler in the SaveLoad object needs to be a shared_ptr for the temporary std::vector<SaveLoad> we create. So we once again get a lot of (unused) dtors for the settings tables .. and it seems that once again is what pissed off compilers.

I am pretty sure I am missing something here, and that the solution is far easier, but it is eluding me atm. Either way .. IT IS BAAACCCKKKKKKK :D

PS: I expected the commit before this one to be the issue, but it seemly is not. So my conclusion it is the shared_ptr might be wrong.

@TrueBrain TrueBrain reopened this Jul 5, 2021
@glx22
Copy link
Contributor

glx22 commented Jul 5, 2021

I did a quick (well not so quick as building release is so slow) test, defining a const SaveLoadCompat testslc and using this object in SLC_VAR and SLC_NULL macros (not tried to run it as compat table are "invalid").
GCC stops complaining, so I think the issue is related to SaveLoadCompat tables.
That would also explain why it's introduced by cdb3dd0 and not the commit before.

@TrueBrain
Copy link
Member Author

Sadly .. although we fixed the GCC warning, clang on MacOS is still crashing with the same error. I do not really know how to debug that further ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment