Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ghc: enable parallel building of GHC on aarch64. #48446

Merged
merged 1 commit into from Oct 15, 2018
Merged

ghc: enable parallel building of GHC on aarch64. #48446

merged 1 commit into from Oct 15, 2018

Conversation

dhess
Copy link
Contributor

@dhess dhess commented Oct 15, 2018

Note: this only affects GHC, not haskellPackages, which still suffers
from https://ghc.haskell.org/trac/ghc/ticket/15449.

Motivation for this change

@ElvishJerricco pointed out on #47901 that GHC's own build system (mostly) doesn't use -j. (This is not quite true as it does seem to use it for building some phases, but my test build made it through successfully, anyway, so perhaps these phases don't trigger the GHC bug mentioned above in the Trac issue.)

Therefore, we can re-enable parallel builds of GHC on aarch64 and produce a working ghc843 in just a few hours on the Packet.net hardware. This will allow Hydra to start compiling haskellPackages and we can deal with aarch64 porting issues as they arise. In attempting to build my own Haskell packages with this compiler, many important packages (e.g., lens) build just fine, but tls's tests are currently broken on this arch.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Fits CONTRIBUTING.md.

Note: this only affects GHC, not haskellPackages, which still suffers
from https://ghc.haskell.org/trac/ghc/ticket/15449.
@dhess
Copy link
Contributor Author

dhess commented Oct 15, 2018

Paging @samueldr

Copy link
Member

@samueldr samueldr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partially reverts the changes that were inconsequential on non-AArch64 platforms. So this is good in that this won't (shouldn't) cause any change there, thus a net zero sum on non-AArch64. Hopefully this'll help the Haskellers interested in AArch64 fix things for AArch64 :).

@samueldr samueldr merged commit b1cb653 into NixOS:master Oct 15, 2018
@ElvishJerricco
Copy link
Contributor

@dhess I wonder if there's a way to control the number passed to -j in the build process, without affecting the number passed to make. It'd be great if we could make all the ghc --make calls avoid -j categorically.

@ElvishJerricco
Copy link
Contributor

But overall I'm happy about this change :)

@TravisWhitaker
Copy link
Contributor

TravisWhitaker commented Oct 29, 2018

The GHC build system uses -j with --make all over the place. Reverting this change breaks the GHC build on my aarch64 boards. As discussed on the ticket, this doesn't afflict all microarchitectures, but if you are affected, whether or not the build succeeds is a dice roll.

I would much rather have a build that works every time, rather than a build that fails on my machines but works on Hydra. Maybe I'm the only one unlucky enough to be using the afflicted hardware?

@ElvishJerricco
Copy link
Contributor

@TravisWhitaker I thought -j and --make are only used in building ghc-cabal and running the test suite. Where else is it used? We should probably disable the test quite though, unless we can figure out how to limit the -j for just that.

@dhess
Copy link
Contributor Author

dhess commented Oct 29, 2018

@TravisWhitaker The entire community benefits from having successful Hydra builds of GHC on aarch64, so I think that's the right tradeoff, even if it causes an inconvenience for you, personally. Though it's a pain, you can override ghc843 for your own builds with an overlay.

However, the point is moot (for now) as it appears that even Hydra is unable to build ghc843 reliably on aarch64 since the initial successful build:

https://hydra.nixos.org/build/83144215

It's timing out after 10 hours, which is interesting, because before this change, when ghc843 was being built with enableParallelBuilding disabled, it always took > 10 hours and therefore Hydra killed it mid-build. (Last I checked, the NixOS Hydra is configured to kill builds after 10 hours.)

The first successful build (https://hydra.nixos.org/build/82810083), just after this commit was made, took 8h23m, so it made the cut. But it might be the case that we just got lucky with that build, and that subsequent builds with enableParallelBuilding enabled have triggered the bug, causing the build to hang and Hydra to terminate it after 10 hours.

It's also possible that the builds were going fine but just took longer than the first successful build and didn't make the 10 hour cutoff.

So, either way, it appears that this change was not successful in its goal of producing consistent GHC Hydra builds on aarch64.

I will say, though, that on my own personal Hydra, I'm doing a daily build of 3 of my own Haskell packages using ghc843 from a daily Nixpkgs master snapshot, using the NixOS community aarch64 builder. These daily builds are working fine, so I'm not sure why they're succeeding when the official NixOS ones are not. I'll look into it.

@TravisWhitaker
Copy link
Contributor

The entire community benefits from having successful Hydra builds of GHC on aarch64, so I think that's the right tradeoff, even if it causes an inconvenience for you, personally.

With enableParallelBuilding = true, this derivation is nondeterministic in a critically bad way. If a derivation needed to be made nondeterministic to enable Hydra builds in a more benign way (i.e. something less than memory corruption), I'd certainly agree. However, unless these GHC binaries are passing validation (IIRC full validation isn't run, but perhaps it is) or we know for sure that the microarchitecture Hydra is using to build these is unaffected, I wouldn't trust the binaries in these packages.

I should've qualified "my" more carefully; my organization has its own Nix infrastructure and GHC tweaks, so we aren't using these binaries anyway. In general, I think a person with afflicted hardware would rather wait for a working binary to build, rather than download one from Hydra that might be arbitrarily broken due to this GHC bug.

It's timing out after 10 hours, which is interesting, because before this change, when ghc843 was being built with enableParallelBuilding disabled, it always took > 10 hours and therefore Hydra killed it mid-build.

As I wrote in the GHC bug report, a common failure mode is a threaded Haskell program hanging forever, waiting on a futex. Perhaps this is the culprit?

I will say, though, that on my own personal Hydra, I'm doing a daily build of 3 of my own Haskell packages using ghc843 from a daily Nixpkgs master snapshot, using the NixOS community aarch64 builder.

You might be running on unafflicted hardware (which seems to be any microarch without dynamic execution) or you've flipped three heads in a row.

I thought -j and --make are only used in building ghc-cabal and running the test suite. Where else is it used? We should probably disable the test quite though, unless we can figure out how to limit the -j for just that.

IIRC it's also used when building the boot packages, but I could be wrong.

@ElvishJerricco
Copy link
Contributor

ElvishJerricco commented Oct 30, 2018

@dhess Wait... The log of that failed build ends the same way as a successful build... Is it possible something irrelevant is hanging for hours? It is not exhibiting the panic caused by -j. There is no panic message in that log. @TravisWhitaker so I don't think this is due to enableParallelBuilding (again, -j should only be used in building ghc-cabal, and maybe some test suites).

EDIT:

As I wrote in the GHC bug report, a common failure mode is a threaded Haskell program hanging forever, waiting on a futex. Perhaps this is the culprit?

It can't be because then we wouldn't be getting to the installPhase, let alone the fixupPhase.

@ElvishJerricco
Copy link
Contributor

@dhess I wonder if we could get Hydra to build it with #45744 so we can see how long each phase is actually taking.

@dhess
Copy link
Contributor Author

dhess commented Oct 30, 2018

@ElvishJerricco Does it always panic? I was under the impression that it sometimes just hangs indefinitely. tmobile on the GHC Trac, who originally filed the issue (https://ghc.haskell.org/trac/ghc/ticket/15449), lists one of the manifestations as, "Compiler process sleeps indefinitely."

(edit Oh, perhaps @TravisWhitaker is tmobile ;)

@ElvishJerricco
Copy link
Contributor

@dhess Ah right. But it still wouldn't get to the installPhase or fixupPhase in that case. It'd hang in buildPhase or checkPhase or something.

@TravisWhitaker
Copy link
Contributor

It can't be because then we wouldn't be getting to the installPhase, let alone the fixupPhase.

Ah, agreed. It seems something else is going on...

@ElvishJerricco
Copy link
Contributor

@dhess Do the builders have auto-optimise-store turned on? I figure for an output as large as GHC, the post-build optimization of the path could potentially take a while on a slow core.

@matthewbauer
Copy link
Member

Does 8.4.4 support cross compiling from aarch64 to another architecture? I am getting some weird errors from cross-trunk:

https://hydra.nixos.org/build/83688270/nixlog/1/tail

ghc-stage1: panic! (the 'impossible' happened)
  (GHC version 8.4.4 for aarch64-unknown-linux):
        padLiveArgs -- i > regNum ??
CallStack (from HasCallStack):
  error, called at compiler/llvmGen/LlvmCodeGen/Base.hs:194:27 in ghc:LlvmCodeGen.Base

@matthewbauer
Copy link
Member

Actually this doesn't seem to be specific to aarch64, so probably not having to do with this PR.

@TravisWhitaker
Copy link
Contributor

Indeed, GHC 8.4.4 is totally broken on ARM.

https://ghc.haskell.org/trac/ghc/ticket/15780

@dhess
Copy link
Contributor Author

dhess commented Nov 5, 2018

I should have more time to look into this in a couple of weeks, but I'm completely slammed at the moment. Meanwhile, as recently as master from 4 days ago, ghc843 continues to work great on the NixOS aarch64 community builder, somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants