Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia module: set LD_LIBRARY_PATH #67893

Closed
wants to merge 2 commits into from

Conversation

abbradar
Copy link
Member

@abbradar abbradar commented Sep 1, 2019

It's needed for CUDA driver, which applications are supposed to dlopen() and
which is a driver version-dependent.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

Tested on Tensorflow which works now. I'll merge in several days unless there are better ideas.

It's needed for CUDA driver, which applications are supposed to dlopen() and
which is a driver version-dependent.
@worldofpeace
Copy link
Contributor

@abbradar See #67791 and #67791 (comment)

@abbradar
Copy link
Member Author

abbradar commented Sep 2, 2019 via email

@ivan
Copy link
Member

ivan commented Sep 2, 2019

@abbradar things in nixpkgs have been using addOpenGLRunpath to patch the runpath on executables and shared libraries, e.g. I submitted a PR for fix CUDA in mpv: #67790

@abbradar
Copy link
Member Author

abbradar commented Sep 2, 2019

@ivan This is similar to OpenGL drivers before libglvnd became a thing, we used LD_LIBRARY_PATH instead of patching all OpenGL software.

I'm not sure if we want to start using addOpenGLRunpath now but that's a (different) architectural decision. Personally I think LD_LIBRARY_PATH is less error-prone as it's very simple to accidentally link to nvidia_x11 instead of using addOpenGLRunpath in a package, and this is wrong and breaks when packages come from different version of nixpkgs.

@ivan
Copy link
Member

ivan commented Sep 2, 2019

This was done in 370d3af for #11434, but I don't fully understand the rationale.

"Currently Mesa and NVidia drivers don't set setLdLibraryPath because they work with libglvnd and do not override libraries" - I believe this is referring to GL calls going through libglvnd and thus not needing LD_LIBRARY_PATH for just GL functionality.

I believe #11434 (comment) references some specific problem that this change solved ("any program that links to one of these libraries will load the one from the AMD drivers. This can result in horrible breakage") but I don't know the details, maybe @ambrop72 can clarify?

@abbradar
Copy link
Member Author

abbradar commented Sep 2, 2019

The idea is that now for dispatch libraries we either:

  • use vendor configuration files which specify full paths to libraries;
  • use setOpenGLRunpath for edge cases (no configuration file, gets library name via GLX etc.).

That means that for vast majority of cases LD_LIBRARY_PATH is not needed anymore because we localized all library search to those dispatch libraries. The problems, however, start when there's no dispatch library for something (i.e. CUDA) - cudatoolkit is only one point of entry into it. IMO for these cases we should fall back to the old approach, which is LD_LIBRARY_PATH - the world of library end-users is potentially immense and checking and patching all these packages would be much bigger task than fixing just the dispatchers.

Regarding the comment - LD_LIBRARY_PATH can be harmful when the driver itself is broken (in this example API drivers) and provides replacements for system libraries. This is out of scope - NVIDIA drivers are clear in that regard.

@abbradar
Copy link
Member Author

abbradar commented Sep 2, 2019

@ambrop72 I see now that you already patched some packages to find CUDA, what do you think of my suggestion instead?

My argument is that it's less error-prone that way (especially if application links to libcuda directly, in which case it's very difficult to notice that it's broken), makes CUDA applications easier to package and that our old policy was to avoid patching all the world, opting for LD_LIBRARY_PATH when we cannot fix this by controlling only dispatcher libraries.

@ambrop72
Copy link
Contributor

ambrop72 commented Sep 2, 2019

I believe #11434 (comment) references some specific problem that this change solved ("any program that links to one of these libraries will load the one from the AMD drivers. This can result in horrible breakage") but I don't know the details, maybe @ambrop72 can clarify?

Whenever a driver ships with their own version of some library, e.g. llvm, which uses the same sonames as system libraries, if an application both links to a nixpkgs driver and uses the driver (possibly very indirectly), only one of the libraries will end up used, one thing will inevitably use the wrong version. This cannot be solved due to the design of Linux dynamic linking. However, if those driver-specific libraries end up in LD_LIBRARY_PATH, then every program that links to such a conflicting library is affected, not just programs that actually use the driver.

I understand the arguments about breaking things, but I think the simple solution of sticking with LD_LIBRARY_PATH is just too invasive, as it has potential to break every program, including those that do not use, or affect startup time. Therefore I still think this (or #67791) should go to 19.09 only and we fix the breakages on master. By the way could we have a meta bug to keep track of such breakages? I can assist fixing them.

@ambrop72
Copy link
Contributor

ambrop72 commented Sep 2, 2019

Also, if we keep the LD_LIBRARY_PATH for 19.09 and backport addOpenGLRunpath fixes to 19.09, then the "older applications not working without LD_LIBRARY_PATH" problem will not be after the upgrade to the next release, where LD_LIBRARY_PATH finally disappears, assuming one is not using applications older than 19.09.

@ambrop72
Copy link
Contributor

ambrop72 commented Sep 2, 2019

Regarding the comment - LD_LIBRARY_PATH can be harmful when the driver itself is broken (in this example API drivers) and provides replacements for system libraries. This is out of scope - NVIDIA drivers are clear in that regard.

Sorry I missed this, yes the NVidia drivers seem fine. But I still wouldn't want to undo the work that went into getting rid of LD_LIBRARY_PATH because of one case (CUDA) and which only affects a fraction of people.

@abbradar
Copy link
Member Author

abbradar commented Sep 2, 2019 via email

@ambrop72
Copy link
Contributor

ambrop72 commented Sep 2, 2019

I think the sledgehammer approach is inconsistent with Nix/NixOS philosophy which is focused on purity. Of course we can't be pure here, so the current solution for such problems is to fix individual packages to look in impure places (/run/opengl-runpath). But doing something that affects everything, whether it needs to or not, is the complete opposite of the ideal.

By the way I filed a PR to fix Houdini (#67958). Generally I will be glad to fix other programs that I am told about.

@ambrop72
Copy link
Contributor

ambrop72 commented Sep 2, 2019

My problem is specifically with CUDA: addOpenGLRunpath is a burden to use for all CUDA applications

Please note that for big applications, setting LD_LIBRARY_PATH in a wrapper (which in many cases already exists) is a simpler approach than patchelf.

@ambrop72
Copy link
Contributor

ambrop72 commented Sep 2, 2019

I suggest that this is merged only to 19.09 and without reverting the CUDA fixes. This would fix the immediate 19.09 issues and leave the most freedom to decide how to proceed - by still having the fixes in 19.09, if the LD_LIBRARY_PATH is not reintroduced in master and therefore is not present in the next release after 19.09, applications in user environments will not break after an upgrade doe to disappearance of LD_LIBRARY_PATH.

@abbradar
Copy link
Member Author

abbradar commented Sep 2, 2019

I'm still not sure we should do this by patching packages but let's try and see what problems arise. The package I was trying to fix now was tensorflow, also tensorflow-bin has invalid RUNPATH (nvidia-x11) and all other sorts of ML packages are broken now - we need to fix them.

I don't think we should merge this for 19.09, rather either merge to post-19.09 unstable or reject. For 19.09 we'd rather leave LD_LIBRARY_PATH always enabled for the time being, there is #67791 for that.

@ambrop72
Copy link
Contributor

ambrop72 commented Sep 2, 2019

I don't think we should merge this for 19.09, rather either merge to post-19.09 unstable or reject. For 19.09 we'd rather leave LD_LIBRARY_PATH always enabled for the time being, there is #67791 for that.

Yeah that's better because it prevents breakage of packages from user environments after upgrade to 19.09 in cases of using open-source drivers.

@abbradar
Copy link
Member Author

abbradar commented Oct 2, 2019

I fixed this bug for tensorflow at least. Let's close this for now.

@abbradar abbradar closed this Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants