New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobset_id, #710 Part 1 #711
Conversation
I had to move all the migrations down by 1 to account for a migration on the flakes branch. These three migrations are pretty fast: 0s:
2m20s:
2s:
Starting the backfill and going to bed. |
Since then it has migrated 4,200,000 rows, much too slowly. It appears to be running the 10 UPDATE queries in about 50 seconds and then running a 10 minute vacuum:
I'm changing the numbers to only vacuum once every 100 iterations. |
It migrated 3,000,000 rows in the last hour. I've changed it to vacuum every 500 iterations, since disk space is still well under control. |
btw, before running this migration, the latest-finished url for nixos-unstable-small took over 8 minutes on an EPYC machine with 64G of RAM:
(system stats, click to expand)
|
in 1h45min, migrated 5,000,000
which suggests this will finish in about 33 hours hopefully :). |
The next 500 iterations / 5,000,000 rows took 3h10min:
The slowdown is a bit spooky to me. Let's see how it progresses. |
Just reached vacuum #3, nearly 6 hours later. It seems strange that this is taking increasing amounts of time. Maybe using a ratcheting minimum ID to update would make this faster. |
The SELECT inside the UPDATE was taking an increasing amount of time as the first few million rows became non-null. I've done two things:
This new run started at:
with MUCH faster 10k increments (~2-8s). Incredibly, this managed to do 500 iterations (5,000,000 rows) in less than 20 minutes. If this continues, it might be done before I wake up. |
I guess the VACUUM is taking longer than I guessed. But none the less:
getting pretty close. |
Round 2 which covers locked rows is running now ... then we can try out part2 and see how the performance goes. |
Finished about 30 minutes ago, including the final VACUUM. I'm thinking 24h is probably a good bet for how long it'll take in prod, but of course under actual production load it may be very different. At any rate, it can happen while it is up and running, and disk space never got out of control. Main objectives achieved :). Still don't want to merge this until after validating part 2. |
Part two's migrations:
|
|
Part 2 got us to a 10 second query:
I didn't like this part:
so I tried adding an index (took 10 minutes): and that turned the parallel gather merge in to an index scan, and brought the whole query to ~250ms (or less):
|
(thanks to @LnL7 who suggested this) |
I think this is ready, and I've opened up #714 for part 2. |
A postgresql column which is non-null and unique is treated with the same optimisations as a primary key, so we have no need to try and recreate the `id` as the primary key. No read paths are impacted by this change, and the database will automatically create an ID for each insert. Thus, no code needs to change.
Also, adds an explicitly named "jobs" accessor to the Jobsets Schema object, which uses the project/jobset name.
Also, adds an explicitly named "builds" accessor to the Jobsets Schema object, which uses the project/jobset name.
Vacuum every 10 iterations, update 10k at a time.
8c11bdf
to
c4cc72f
Compare
@grahamc This is a really great improvement, thanks! As the upgrade requires manual steps (due to the backfiller), I realized that anyone upgrading their Hydra will be hit by a big surprise, since there is no document in the repo listing the necessary steps. They are only discoverable by rummaging through closed PRs, which is not the best way to do it. So, what would you say about starting to have proper, versioned releases with a changelog file documenting the changes, and, more importantly, breaking changes. |
Upgrades Hydra to the latest master/flake branch. To perform this upgrade, it's needed to do a non-trivial db-migration which provides a massive performance-improvement[1]. The basic ideas behind multi-step upgrades of services between NixOS versions have been gathered already[2]. For further context it's recommended to read this first. Basically, the following steps are needed: * Upgrade to a non-breaking version of Hydra with the db-changes (columns are still nullable here). If `system.stateVersion` is set to something older than 20.03, the package will be selected automatically, otherwise `pkgs.hydra-migration` needs to be used. * Run `hydra-backfill-ids` on the server. * Deploy either `pkgs.hydra-unstable` (for Hydra master) or `pkgs.hydra-flakes` (for flakes-support) to activate the optimization. The steps are also documented in the release-notes and in the module using `warnings`. `pkgs.hydra` has been removed as latest Hydra doesn't compile with `pkgs.nixStable` and to ensure a graceful migration using the newly introduced packages. To verify the approach, a simple vm-test has been added which verifies the migration steps. [1] NixOS/hydra#711 [2] NixOS#82353 (comment)
Upgrades Hydra to the latest master/flake branch. To perform this upgrade, it's needed to do a non-trivial db-migration which provides a massive performance-improvement[1]. The basic ideas behind multi-step upgrades of services between NixOS versions have been gathered already[2]. For further context it's recommended to read this first. Basically, the following steps are needed: * Upgrade to a non-breaking version of Hydra with the db-changes (columns are still nullable here). If `system.stateVersion` is set to something older than 20.03, the package will be selected automatically, otherwise `pkgs.hydra-migration` needs to be used. * Run `hydra-backfill-ids` on the server. * Deploy either `pkgs.hydra-unstable` (for Hydra master) or `pkgs.hydra-flakes` (for flakes-support) to activate the optimization. The steps are also documented in the release-notes and in the module using `warnings`. `pkgs.hydra` has been removed as latest Hydra doesn't compile with `pkgs.nixStable` and to ensure a graceful migration using the newly introduced packages. To verify the approach, a simple vm-test has been added which verifies the migration steps. [1] NixOS/hydra#711 [2] #82353 (comment) (cherry picked from commit bd5324c)
Upgrades Hydra to the latest master/flake branch. To perform this upgrade, it's needed to do a non-trivial db-migration which provides a massive performance-improvement[1]. The basic ideas behind multi-step upgrades of services between NixOS versions have been gathered already[2]. For further context it's recommended to read this first. Basically, the following steps are needed: * Upgrade to a non-breaking version of Hydra with the db-changes (columns are still nullable here). If `system.stateVersion` is set to something older than 20.03, the package will be selected automatically, otherwise `pkgs.hydra-migration` needs to be used. * Run `hydra-backfill-ids` on the server. * Deploy either `pkgs.hydra-unstable` (for Hydra master) or `pkgs.hydra-flakes` (for flakes-support) to activate the optimization. The steps are also documented in the release-notes and in the module using `warnings`. `pkgs.hydra` has been removed as latest Hydra doesn't compile with `pkgs.nixStable` and to ensure a graceful migration using the newly introduced packages. To verify the approach, a simple vm-test has been added which verifies the migration steps. [1] NixOS/hydra#711 [2] NixOS#82353 (comment)
I'm seeing somewhat strange behavior while running
The script claims that only 11 jobs need to be modified, causing the remaining count to quickly go negative. Also, the batches are only affecting 3 rows at a time. |
If you cancel and restart, does it do that again?
On Thu, Apr 16, 2020, at 9:35 PM, Ben Wolsieffer wrote:
I'm seeing somewhat strange behavior while running `hydra-backfill-ids`. This is the beginning of the log:
`(pass 1/2) Backfilling Jobs records where jobset_id is NULL...
(pass 1/2) Total Jobs records without a jobset_id: 11
(pass 1/2) (batch #1; 11 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #2; 8 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #3; 5 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #4; 2 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #5; -1 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #6; -4 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #7; -7 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #8; -10 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #9; -13 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #10; -16 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #11; -19 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #12; -22 remaining) Jobs.jobset_id: affected 3 rows...
`
… The script claims that only 11 jobs need to be modified, causing the remaining count to quickly go negative. Also, the batches are only affecting 3 rows at a time.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASXLHBQUVIRMBHRELQSGTRM6W5NANCNFSM4KRDBFBA>.
|
Yes |
Is there an evaluator or queue runner running at the same time adding jobs? I’m not too worried about the remaining number going below zero. When you run it again, does the number to fix change?
…On Thu, Apr 16, 2020, at 9:38 PM, Ben Wolsieffer wrote:
Yes
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASXLC2LPYRI25GMCTHPM3RM6XJFANCNFSM4KRDBFBA>.
|
No jobs are being added. The queue runner is stopped because I have been having some issues with my builders. There are 40 jobs in the queue right now. I had the script running in the background for a few hours and it reached a remaining count of <-3,000,000 before I started to suspect something wasn't working right. I don't know what the remaining count was the first time I ran the script, but it has remained at 11 each time I have run it since. Some queries:
|
Upgrades Hydra to the latest master/flake branch. To perform this upgrade, it's needed to do a non-trivial db-migration which provides a massive performance-improvement[1]. The basic ideas behind multi-step upgrades of services between NixOS versions have been gathered already[2]. For further context it's recommended to read this first. Basically, the following steps are needed: * Upgrade to a non-breaking version of Hydra with the db-changes (columns are still nullable here). If `system.stateVersion` is set to something older than 20.03, the package will be selected automatically, otherwise `pkgs.hydra-migration` needs to be used. * Run `hydra-backfill-ids` on the server. * Deploy either `pkgs.hydra-unstable` (for Hydra master) or `pkgs.hydra-flakes` (for flakes-support) to activate the optimization. The steps are also documented in the release-notes and in the module using `warnings`. `pkgs.hydra` has been removed as latest Hydra doesn't compile with `pkgs.nixStable` and to ensure a graceful migration using the newly introduced packages. To verify the approach, a simple vm-test has been added which verifies the migration steps. [1] NixOS/hydra#711 [2] NixOS#82353 (comment) (cherry picked from commit bd5324c)
I'm getting this error consistently. EDIT: manually removing the duplicate rows has fixed the issue. |
Ready to go.
The first half of #710, adding nullable jobset_id fields and populating them.
Currently marking this as a draft, here is my personal to-do list before merging:
nix-build ./release.nix -A build -A manual -A tests.install -A tests.api -A tests.notifications
works