Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] HPC improvements #79771

Closed
wants to merge 13 commits into from
Closed

[WIP] HPC improvements #79771

wants to merge 13 commits into from

Conversation

ikervagyok
Copy link
Contributor

Motivation for this change

Want to convince my colleagues, that NixOS/NixOps is OK for a test cluster setup which consists of a few AArch64 nodes. LAMMPS is the package I chose to give to my test users, that's why I've converted it to CMake and added a few extra LAMMPS packages

Things done

Updated stuff and added libfabric, pmix and ucx. Tested with 2 nodes, will go up to 4 tomorrow and maybe some x86 nodes in the next few days.

@markuskowa since you seem to be at the center of NixOS-HPC what do you think about these changes? What should be made optional/reworked? I know It's not a polished PR, but I think can be made into something usable.

  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@costrouc
Copy link
Member

Very happy to see that lammps finally has a standard build procedure with cmake!

Copy link
Member

@markuskowa markuskowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ikervagyok. Thank you for the contribution. It's nice to see that there is more and more HPC activity with Nix. As a general comment: this is a big PR that addresses several different issues.
It may be better to split it it up in several PRs (e.g. lammps, openmpi, slurm). It would also simplify the review if you could briefly comment how you have tested the additional features.
In my experience it is one thing to get things to compile but a completely separate thing to make them work. There is usually some fine tuning involved. I test drive new packages in my own overlay before I upstream them (feel free to copy from the overlay what you need).

@@ -355,6 +355,7 @@ in
ExecStart = "${wrappedSlurm}/bin/slurmd";
PIDFile = "/run/slurmd.pid";
ExecReload = "${pkgs.coreutils}/bin/kill -HUP $MAINPID";
LimitMEMLOCK = "infinity";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed the issue of slurm not being able to use mpirun, while it worked in a shell. I'll try again to see if i've messed up my testing.

pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
configureFlags = with stdenv; [
"--with-cma"
"--with-hwloc=${hwloc.dev}"
"--with-ofi=${libfabric}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benefit does libfabric bring (I am not familiar with libfabric)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OmniPath needs libfabric - can't try though, since I have no OmniPath machine in a test setup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to remove it then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a dependency, or the whole package?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the whole package out for now since the libfabric is not even built with OmniPath in the present version. We can add that later, when it becomes relevant.

"--with-cma"
"--with-hwloc=${hwloc.dev}"
"--with-ofi=${libfabric}"
"--with-pmi=${pmix}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to build with pmix in the past but could not get it work properly with slurm. Did you try if you can run openmpi binaries with srun instead of mpirun?
In general I would say pmix is not strictly necessary as openmpi already works nicely with slurm and detects the resource allocations properly. From how i understand pmix it is designed for large mpi jobs (>100 nodes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, i've thought it worked, but i probably mistakenly tested mpirun multiple times?! will look into this.

"--with-pmi=${pmix}"
"--with-pmix=${pmix} --enable-install-libpmix"
"--with-verbs=${rdma-core}"
"--with-ucx=${ucx}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (@mekeetsa) are testing UCX with infiniband at the moment (see also https://github.com/markuskowa/NixOS-QChem/blob/ucx/ucx/default.nix). Did you test it with an Infiniband configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i've tested it and it seems to work (setting up the env with OMPI_MCA_btl=^openib,uct,tcp UCX_TLS=ib,sm,self)

pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
pkgs/servers/computing/slurm/default.nix Outdated Show resolved Hide resolved
@ofborg ofborg bot requested a review from markuskowa February 11, 2020 14:02
@@ -29,6 +30,12 @@ in stdenv.mkDerivation rec {
sha256 = "0ms0zvyxyy3pnx9qwib6zaljyp2b3ixny64xvq3czv3jpr8zf2wh";
};

# TODO: Remove these, when a new release gets out
patches = [
(fetchpatch { name = "ucx-1.8-compat-1"; url = "https://github.com/open-mpi/ompi/commit/526775dfd7ad75c308532784de4fb3ffed25458f.patch"; sha256 = "1zfr7ym4303dsnnjqfc1qbnfs0izd8bazxfl57p3dhqsiznd734r"; })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still confused why there are patches for UCX 1.8 when are building against UCX 1.7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the two commit messages:

open-mpi/ompi@526775d
open-mpi/ompi@a3026c0

The first one adds support and the second replaces the version bound version > 1.7 with version >= 1.7. They are named that way, to make it clear, that they can be removed together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand the patches the first one fixes a problem with ucx/master (correct me if I am wrong). Since we are not building against ucx 1.8 (or current master) that patch is not needed right now. We should remove it for now.

@markuskowa
Copy link
Member

Can we close this PR? Most of the improvements have now been addressed in other PRs.

@ikervagyok ikervagyok closed this Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants