New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph: 14.2.10 -> 15.2.4 #92791
ceph: 14.2.10 -> 15.2.4 #92791
Conversation
Can this be rebased on latest master? |
…nabled by default - the pg_autoscaler will force new empty pools down to 32 pgs - device monitoring metrics consumes 1 pool with 1 pg
ps.pecan | ||
ps.prettytable | ||
ps.pyjwt | ||
ps.webob | ||
ps.bcrypt | ||
ps.scipy_1_3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PR description:
ceph-mgr hangs during startup due to incompatibility with newer scipy versions, thus the version downgrade. Thanks @srhb.
Please put a comment on this here. What is known about this bug? Is there an upstream issue?
I tried to upgrade our Ceph (packaged via a separate derivation), noticed the ceph-mgr
hang, but did not figure out how to fix it. How did you find out @srhb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream issues are here: https://tracker.ceph.com/issues/42764 and here: https://tracker.ceph.com/issues/45147. AFAIK, the current official upstream solution is to disable the diskprediction_local module if python is >= 3.8 instead of trying to fix the actual problem (https://github.com/ceph/ceph/pull/34846/files), since Python 3.8 is not officially supported. However, @srhb figured that downgrading scipy made diskprediction_local runnable. Better wait for her to give an explanation on that part :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a comment with the links on monday.
I figured it out with strace on the mgr threads and noticing that one of them was stuck deep in scipy land. From there, I found the above issues, which upstream doesn't seem terribly keen on fixing, because RHEL/Ubuntu don't use a new enough python/scipy to even experience the problem yet.
As such, I consider the scipy downgrade our best path, because local disk prediction is a potentially nice feature to have, and hopefully upstream will become compatible in time so that we can drop the old scipy. I'll comment to that effect in the pr. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
er, not strace, gdb*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, by the way, currently they don't intend to backport the diskprediction_local disable "hotfix" for the same reasons (RHEL/Ubuntu are unaffected) so the alternative solution would be to patch it in from ceph master which also works, but I don't think that's as nice.
I also expect they will eventually change stance on the backport once more non-RHEL/Ubuntu ceph users upgrade to octopus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just flip to python 3.7 here instead of 3.8, couldn't we? It's probably less maintenance overhead than manually verifying compatibility with each python module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I actually tested that, but I can certainly try and see if it works. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No go with pythonPackages37
-- and with pythonPackages36
I run into dependency hell with a load of other deps. I'm inclined to keep this solution. I've added the issue links inline, PTAL. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 👍
@flokli rebase done. I'm currently re-running the tests locally. |
I also want to point out (in case it's useful to others) that both Ceph 14 and Ceph 15 have a race condition I found, which is why I'm still on Ceph 13: |
I ran the single and multi node tests successfully. This PR looks good to me 👍 Regarding dropping This only seems to enable To me it's even unclear if we enable SPDK at all, at least I couldn't spot it in the build log. @krav, maybe you can provide some insight? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because my request for comments was fulfilled (I did not run tests).
ps.pecan | ||
ps.prettytable | ||
ps.pyjwt | ||
ps.webob | ||
ps.bcrypt | ||
ps.scipy_1_3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 👍
Motivation for this change
Bump ceph to newest major release.
Singlenode and Multinode tests pass locally.
I also performed a full upgrade from Nautilus to Octopus on our test cluster.
Everything works as expected.
Regarding the SPDK-build-env patch: I re-aligned the patch, as it didn't apply correctly. However; I hope someone can help me verify at some point whether this patch is even needed anymore? I don't know what it does in detail other than enabling some machine arch specific features.
Regarding scipy: ceph-mgr hangs during startup due to incompatibility with newer scipy versions, thus the version downgrade. Thanks @srhb.
Ceph has changed a lot of config defaults and it is recommended to read the release notes before upgrading:
https://docs.ceph.com/docs/master/releases/octopus/#v15-2-4-octopus
Things done
sandbox
innix.conf
on non-NixOS linux)nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
./result/bin/
)nix path-info -S
before and after)