Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: 14.2.10 -> 15.2.4 #92791

Merged
merged 3 commits into from Jul 14, 2020
Merged

ceph: 14.2.10 -> 15.2.4 #92791

merged 3 commits into from Jul 14, 2020

Conversation

johanot
Copy link
Contributor

@johanot johanot commented Jul 9, 2020

Motivation for this change

Bump ceph to newest major release.

Singlenode and Multinode tests pass locally.

I also performed a full upgrade from Nautilus to Octopus on our test cluster.
Everything works as expected.

Regarding the SPDK-build-env patch: I re-aligned the patch, as it didn't apply correctly. However; I hope someone can help me verify at some point whether this patch is even needed anymore? I don't know what it does in detail other than enabling some machine arch specific features.

Regarding scipy: ceph-mgr hangs during startup due to incompatibility with newer scipy versions, thus the version downgrade. Thanks @srhb.

Ceph has changed a lot of config defaults and it is recommended to read the release notes before upgrading:
https://docs.ceph.com/docs/master/releases/octopus/#v15-2-4-octopus

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

pkgs/tools/filesystems/ceph/default.nix Outdated Show resolved Hide resolved
@flokli
Copy link
Contributor

flokli commented Jul 11, 2020

Can this be rebased on latest master?

Johan Thomsen added 2 commits July 11, 2020 14:12
…nabled by default

- the pg_autoscaler will force new empty pools down to 32 pgs
- device monitoring metrics consumes 1 pool with 1 pg
ps.pecan
ps.prettytable
ps.pyjwt
ps.webob
ps.bcrypt
ps.scipy_1_3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the PR description:

ceph-mgr hangs during startup due to incompatibility with newer scipy versions, thus the version downgrade. Thanks @srhb.

Please put a comment on this here. What is known about this bug? Is there an upstream issue?

I tried to upgrade our Ceph (packaged via a separate derivation), noticed the ceph-mgr hang, but did not figure out how to fix it. How did you find out @srhb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream issues are here: https://tracker.ceph.com/issues/42764 and here: https://tracker.ceph.com/issues/45147. AFAIK, the current official upstream solution is to disable the diskprediction_local module if python is >= 3.8 instead of trying to fix the actual problem (https://github.com/ceph/ceph/pull/34846/files), since Python 3.8 is not officially supported. However, @srhb figured that downgrading scipy made diskprediction_local runnable. Better wait for her to give an explanation on that part :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh2

I can add a comment with the links on monday.

I figured it out with strace on the mgr threads and noticing that one of them was stuck deep in scipy land. From there, I found the above issues, which upstream doesn't seem terribly keen on fixing, because RHEL/Ubuntu don't use a new enough python/scipy to even experience the problem yet.

As such, I consider the scipy downgrade our best path, because local disk prediction is a potentially nice feature to have, and hopefully upstream will become compatible in time so that we can drop the old scipy. I'll comment to that effect in the pr. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

er, not strace, gdb*

Copy link
Contributor

@srhb srhb Jul 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, by the way, currently they don't intend to backport the diskprediction_local disable "hotfix" for the same reasons (RHEL/Ubuntu are unaffected) so the alternative solution would be to patch it in from ceph master which also works, but I don't think that's as nice.

I also expect they will eventually change stance on the backport once more non-RHEL/Ubuntu ceph users upgrade to octopus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just flip to python 3.7 here instead of 3.8, couldn't we? It's probably less maintenance overhead than manually verifying compatibility with each python module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I actually tested that, but I can certainly try and see if it works. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No go with pythonPackages37 -- and with pythonPackages36 I run into dependency hell with a load of other deps. I'm inclined to keep this solution. I've added the issue links inline, PTAL. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍

@johanot
Copy link
Contributor Author

johanot commented Jul 11, 2020

@flokli rebase done. I'm currently re-running the tests locally.

@nh2
Copy link
Contributor

nh2 commented Jul 11, 2020

I also want to point out (in case it's useful to others) that both Ceph 14 and Ceph 15 have a race condition I found, which is why I'm still on Ceph 13:

https://tracker.ceph.com/issues/46124

@flokli
Copy link
Contributor

flokli commented Jul 14, 2020

I ran the single and multi node tests successfully. This PR looks good to me 👍

Regarding dropping 0000-fix-SPDK-build-env.patch:

This only seems to enable C_OPT="-mssse3" somewhere in the SPDK build. I wonder if this isn't already done somewhere else in the build system, depending on the architecture. And it should really be architecture-specific - -msse3 probably doesn't exist on aarch64, for example.

To me it's even unclear if we enable SPDK at all, at least I couldn't spot it in the build log.

@krav, maybe you can provide some insight?

Copy link
Contributor

@nh2 nh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because my request for comments was fulfilled (I did not run tests).

ps.pecan
ps.prettytable
ps.pyjwt
ps.webob
ps.bcrypt
ps.scipy_1_3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍

@flokli flokli merged commit b6c53e3 into NixOS:master Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants