Give Jobsets an ID; add jobset_id to builds and jobs. #710

grahamc · 2020-02-06T19:35:32Z

Some background on this issue is available:

Give Jobsets an ID; add jobset_id to builds and jobs. #710 (comment)
my personal notes: https://www.notion.so/grahamc/Foreign-ID-keys-on-Builds-for-Hydra-6d246851b1b0436eba09e1e00138358b
part 1 and updates on how part 1's trial deployment is going: jobset_id, #710 Part 1 #711
discourse announcement thread: https://discourse.nixos.org/t/upcoming-hydra-schema-change/5806

note: each commit here is pretty small and should be very easy to review.

Here is the proposed plan for this migration:

Add an id to the Jobsets table: serial, non-null, unique.
- Fast.
- Postgresql will automatically backfill, and new writes will automatically have them too.
- Code changes: None. Fully backwards and forward compatible.
Add a jobset_id to the Jobs table: nullable, foreign key to Jobsets
- Fast.
- Code changes:
  - All places writing to Jobs should begin writing the jobset_id
    - I believe there is only one place:
    - hydra/src/script/hydra-eval-jobset
      
      Line 420 in 9da60e3
      
      my $job = $jobset->jobs->update_or_create({ name => $jobName });
Add a jobset_id to the Builds table: nullable, foreign key to Jobsets (3f074388)
- Fast.
- Code changes:
  - All places writing to Builds should begin writing the jobset_id
    - I believe there is only one place:
    - hydra/src/script/hydra-eval-jobset
      
      Line 465 in 9da60e3
      
      $build = $job->builds->create(
      
      (52283da9)
Backfill Jobs with jobset_id values.
1. The naive way to backfill has many problems:
  1. Very slow
  2. Huge amount of time with a read lock.
  3. Rewrites the entire table on disk in one shot, causing a full 2x table bloat
2. Solution:
  - Create a purpose-built tool to incrementally backfill the table:
    1. running this in a loop:
      UPDATE jobs SET jobset_id = ( SELECT jobsets.id FROM jobsets WHERE jobsets.name = jobs.jobset AND jobsets.project = jobs.project ) WHERE (jobs.project, jobs.jobset, jobs.name) in ( SELECT jobsprime.project, jobsprime.jobset, jobsprime.name FROM jobs jobsprime WHERE jobsprime.jobset_id IS NULL FOR UPDATE SKIP LOCKED LIMIT 10000 );
    2. Every N iterations, run VACUUM
  - Hydra can stay fully online during the entire migration
  - The subselect of a specific collection of IDs allows the write lock to only affect those rows.
  - VACUUM will prevent 2x table bloat from happening all at once
Backfill Builds with jobset_id values.
1. The naive way to backfill has many problems:
  1. Very slow
  2. Huge amount of time with a read lock.
  3. Rewrites the entire table on disk in one shot, causing a full 2x table bloat
2. Solution:
  - Create a purpose-built tool to incrementally backfill the table:
    1. running this in a loop:
      UPDATE builds SET jobset_id = ( SELECT jobsets.id FROM jobsets WHERE jobsets.name = builds.jobset AND jobsets.project = builds.project ) WHERE builds.id in ( SELECT buildprime.id FROM builds buildprime WHERE buildprime.jobset_id IS NULL ORDER BY buildprime.id FOR UPDATE SKIP LOCKED LIMIT 10000 );
    2. Every N iterations, run VACUUM
  - Hydra can stay fully online during the entire migration
  - The subselect of a specific collection of IDs allows the write lock to only affect those rows.
  - VACUUM will prevent 2x table bloat from happening all at once
Perform an explicit release and put this in to production, and run the backfill tool until all rows are updated.
1. Monitor to see if new Builds or Jobs rows are added with null jobset_id fields, this would indicate a bug which needs to be fixed.
Modify the Builds table, doing two things in one transaction:
1. Assert that there are no rows with a null jobset_id, any rows with a null jobset_id is likely from a bug we should fix.
2. Alter the Builds table to make jobset_id not-null. Hopefully not very slow, I think postgresql will only validate no rows have a null jobset_id.
Modify the Jobs table, doing two things in one transaction:
1. Assert that there are no rows with a null jobset_id, any rows with a null jobset_id is likely from a bug we should fix.
2. Alter the Jobs table to make jobset_id not-null. Hopefully not very slow, I think postgresql will only validate no rows have a null jobset_id.
Alter the read paths to read through the jobset_id field for Builds and Jobs.

Under this plan we would:

merge the first half of the PR (up until Jobs.jobset_id: make not-null)
deploy to hydra.nixos.org
run hydra-init to add the nullable columns
run hydra-backfill-ids
once that is done, give the code a day or so in prod
run hydra-backfill-ids to see if any new records have been created with NULL jobset_id columns
if any bugs are found, fix & deploy those changes, repeating 5 & 6 until we see no more records with NULL jobset_id values.

THEN: merge the second half (migration making them not-null) and deploy to production.

This becomes a bit more complicated for users other than nixos.org. I have tried to make a clear warning in hydra-init. We might want to create a branch off of master after merging the first half, in case there are bugs found in the code after we have merged the second half.

Currently marking this as a draft, here is my personal to-do list before merging:

Run the first half against a replica of production's database, and time how long the migration and backfiller takes
Run the evaluator a bit to add more rows
Run the second half of the migration and confirm it both works and doesn't take a long time
Verify that the queries for the latest-finished URL completes quickly, and the EXPLAIN ANALYZE output is improved over the current state.
Open two PRs, one for the first and and the other for the second half.
Confirm for both of those PRs that nix-build ./release.nix -A build -A manual -A tests.install -A tests.api -A tests.notifications works a final time.
Copy the flake migration from the flake branch #713

A postgresql column which is non-null and unique is treated with the same optimisations as a primary key, so we have no need to try and recreate the `id` as the primary key. No read paths are impacted by this change, and the database will automatically create an ID for each insert. Thus, no code needs to change.

Also, adds an explicitly named "jobs" accessor to the Jobsets Schema object, which uses the project/jobset name.

Also, adds an explicitly named "builds" accessor to the Jobsets Schema object, which uses the project/jobset name.

Vacuum every 10 iterations, update 10k at a time.

edolstra · 2020-02-07T16:40:42Z

src/sql/upgrade-62.sql

+-- very important to figure out where the nullable columns came from.
+
+ALTER TABLE Builds
+  ALTER COLUMN jobset_id SET NOT NULL;


Drop the project and jobset columns?

I think we should eventually, but I'm a bit afraid to do that in this PR since it makes it basically impossible to undo if there is a mistake. Also, it means we need to be 100% certain every read path is updated. I think that is a good improvement to make for a follow-up PR, maybe within a few weeks of this being deployed. Sound okay?

knl · 2020-02-07T17:31:43Z

What is the motivation for this change?

grahamc · 2020-02-07T17:54:06Z

The query executed by the latest-finished link for a given job is very inefficient. The query for the tested job in the nixos-unstable-small jobset has started taking well over 3 minutes, and on an unloaded and fairly powerful test box took over 8 minutes.

Every slow query reported by our database server involves reading or writing to the Builds table.

Here are some table sizes from hydra.nixos.org:

select count(*) from jobs; => 4,642,712
select count(*) from builds; => 107,935,932
select count(*) from jobsetevalmembers; => 1,176,106,887

Here is the EXPLAIN ANALYZE for the latest-finished query, which took 230s last Tuesday. This query has been "slow" 9,294 times since Jan 07 21:18:30.

    EXPLAIN ANALYZE SELECT me.id, me.finished, me.timestamp, me.project, me.jobset, me.job,
        me.nixname, me.description, me.drvpath, me.system, me.license, me.homepage,
        me.maintainers, me.maxsilent, me.timeout, me.ischannel, me.iscurrent,
        me.nixexprinput, me.nixexprpath, me.priority, me.globalpriority,
        me.starttime, me.stoptime, me.iscachedbuild, me.buildstatus, me.size,
        me.closuresize, me.releasename, me.keep, me.notificationpendingsince
    FROM Builds me
    LEFT JOIN JobsetEvalMembers jobsetevalmembers
        ON jobsetevalmembers.build = me.id
        WHERE (
            (
                not exists (
                    select 1 from jobsetevalmembers m2
                    join builds b2
                        on jobsetevalmembers.eval = m2.eval
                            and m2.build = b2.id
                            and b2.finished = 0
                )
                AND me.buildstatus = '0'
                AND me.finished = '1'
                AND me.job = 'tested'
                AND me.jobset = 'unstable-small'
                AND me.project = 'nixos'
            )
        )
    ORDER BY id DESC LIMIT '1'
   
    Limit  (cost=2639.88..2639.88 rows=1 width=413) (actual time=232894.816..232894.818 rows=1 loops=1)
       ->  Sort  (cost=2639.88..2639.97 rows=37 width=413) (actual time=232894.815..232894.815 rows=1 loops=1)
             Sort Key: me.id DESC
             Sort Method: top-N heapsort  Memory: 25kB
             ->  Nested Loop Left Join  (cost=1.27..2639.70 rows=37 width=413) (actual time=90.700..232885.490 rows=3884 loops=1)
                   Filter: (NOT (SubPlan 1))
                   ->  Index Scan using indexbuildsonjobfinishedid on builds me  (cost=0.69..8.70 rows=6 width=413) (actual time=0.056..36.171 rows=3884 loops=1)
                         Index Cond: ((project = 'nixos'::text) AND (jobset = 'unstable-small'::text) AND (job = 'tested'::text) AND (finished = 1))
                         Filter: (buildstatus = 0)
                         Rows Removed by Filter: 2830
                   ->  Index Scan using indexjobsetevalmembersonbuild on jobsetevalmembers  (cost=0.58..4.74 rows=100 width=8) (actual time=0.010..0.011 rows=1 loops=3884)
                         Index Cond: (build = me.id)
                   SubPlan 1
                     ->  Nested Loop  (cost=0.99..4.33 rows=1 width=0) (actual time=59.936..59.936 rows=0 loops=3884)
                           ->  Index Scan using indexbuildsonfinished on builds b2  (cost=0.41..1.53 rows=1 width=4) (actual time=0.005..22.836 rows=33852 loops=3884)
                           ->  Index Only Scan using jobsetevalmembers_pkey on jobsetevalmembers m2  (cost=0.58..2.80 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=131481168)
                                 Index Cond: ((eval = jobsetevalmembers.eval) AND (build = b2.id))
                                 Heap Fetches: 0
     Planning Time: 1.731 ms
     Execution Time: 232894.859 ms
    (20 rows)

In prior tests, changing this to use a jobset_id on the Builds table was able to significantly improve this query's performance: returning in just 26.727 ms. Before merging these changes, I will be validating that these changes actually has the performance improvement we're hoping for.

nixos-discourse · 2020-02-07T18:05:19Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/upcoming-hydra-schema-change/5806/1

grahamc · 2020-02-09T20:51:09Z

I've opened #714 for part two. Both part 1 and part 2 work quite nicely on my replica from production.

jobset_id, #710 Part 1

grahamc · 2020-02-10T17:26:10Z

Deployment of Part 1

Part 1 is deployed and the backfiller is running as of Mon Feb 10 18:12:38 CET 2020.

The first vacuum finished at 18:43, and now it is updating the Jobs table in batches of 10,000 at about 3-4 batches per second. Looks like there are 3.5 million rows.

grahamc · 2020-02-10T17:48:34Z

Pass 2 of the Jobs table finished at Mon Feb 10 18:46:08 CET 2020. Now on to the Builds table, this is the big one. Counting the number of rows itself takes an appreciable amount of time (84 seconds.)

The builds table is able to do a 10k batch in about 2-3 seconds.108,403,861 rows to do, ~108,303,900 rows remaining.

grahamc · 2020-02-10T18:09:56Z

It has started the first VACUUM at Mon Feb 10 19:04:06 CET 2020, averaging about 2.5s per 10k batch. VACUUM finished at Mon Feb 10 19:17:43 CET 2020.

grahamc · 2020-02-10T18:54:03Z

The third VACUUM just started (Mon Feb 10 19:52:18 CET 2020.) We have 93,403,861 rows remaining. This server is handling the migration much faster than my test box. We will very plausibly finish this in less than 8 hours.

grahamc · 2020-02-10T20:21:37Z

The sixth vacuum started:

Mon Feb 10 21:19:24 CET 2020    (pass 1/2) (batch #3000) Vacuuming...

Currently updated from IDs 1 to 33,581,838, and 78,403,861 IDs to go.

grahamc · 2020-02-10T22:51:18Z

We're half way:

Mon Feb 10 23:49:48 CET 2020 (pass 1/2) (batch #5281; 55603861 remaining) Builds.jobset_id: affected 10000 rows; max ID: 56486161 -> 56496161

so another 5-6 h I guess.

grahamc · 2020-02-11T02:46:12Z

Tue Feb 11 03:33:26 CET 2020    (pass 1/2) (batch #8500; 23413861 remaining) Builds.jobset_id: affected 10000 rows; max ID: 88771122 -> 88781122
Tue Feb 11 03:33:26 CET 2020    (pass 1/2) (batch #8500) Vacuuming...

grahamc · 2020-02-11T12:01:05Z

Deploy of part 1 finished

Tue Feb 11 06:50:52 CET 2020    (pass 1/2) Backfilling Jobs records where jobset_id is NULL...
Tue Feb 11 06:50:54 CET 2020    (pass 1/2) Total Jobs records without a jobset_id: 0
Tue Feb 11 06:50:55 CET 2020    (pass 1/2) (batch #1; 0 remaining) Jobs.jobset_id: affected 0E0 rows...
Tue Feb 11 06:50:55 CET 2020    (pass 2/2) Backfilling Jobs records where jobset_id is NULL...
Tue Feb 11 06:50:55 CET 2020    (pass 2/2) Total Jobs records without a jobset_id: 0
Tue Feb 11 06:50:55 CET 2020    (pass 2/2) (batch #1; 0 remaining) Jobs.jobset_id: affected 0E0 rows...
Tue Feb 11 06:50:55 CET 2020    (pass 1/2) Backfilling unlocked Builds records where jobset_id is NULL...
Tue Feb 11 07:03:50 CET 2020    (pass 1/2) Total Builds records without a jobset_id: 0, starting at
Tue Feb 11 07:03:50 CET 2020    (pass 1/2) (batch #1; 0 remaining) Builds.jobset_id: affected 0 rows; max ID:  ->
Tue Feb 11 07:03:50 CET 2020    (pass 2/2) Backfilling all Builds records where jobset_id is NULL...
Tue Feb 11 07:14:14 CET 2020    (pass 2/2) Total Builds records without a jobset_id: 0, starting at
Tue Feb 11 07:14:14 CET 2020    (pass 2/2) (batch #1; 0 remaining) Builds.jobset_id: affected 0 rows; max ID:  ->
Tue Feb 11 07:14:14 CET 2020    Ending with a VACUUM

grahamc · 2020-02-11T13:40:02Z

After a minor oopsie: 8347934

we're deploying part 2:

Feb 11 14:38:49 ceres systemd[1]: Starting hydra-init.service...
Feb 11 14:38:50 ceres hydra-init[2203]: upgrading Hydra schema from version 61 to 62
Feb 11 14:38:50 ceres hydra-init[2203]: executing SQL statement: ALTER TABLE Jobs
Feb 11 14:38:50 ceres hydra-init[2203]:   ALTER COLUMN jobset_id SET NOT NULL
Feb 11 14:38:51 ceres hydra-init[2203]: upgrading Hydra schema from version 62 to 63
Feb 11 14:38:51 ceres hydra-init[2203]: executing SQL statement: ALTER TABLE Builds
Feb 11 14:38:51 ceres hydra-init[2203]:   ALTER COLUMN jobset_id SET NOT NULL

grahamc · 2020-02-11T14:02:45Z

We're done, part 1 and 2 merged and deployed. Initially, the latest-finished URL for nixos-unstable-small returned in ~500ms, though now is not responding -- I think because this involved a full restart of Hydra, which is a very heavy procedure. I'll give it another try in a bit.

grahamc added 13 commits February 6, 2020 14:27

Jobs: add a nullable jobset_id foreign key to Jobsets.

1395bae

Also, adds an explicitly named "jobs" accessor to the Jobsets Schema object, which uses the project/jobset name.

Jobs: populate Jobs.jobset_id field when writing from hydra-eval-jobset

5ba0145

Builds: add a nullable jobset_id foreign key to Jobsets.

9795825

Also, adds an explicitly named "builds" accessor to the Jobsets Schema object, which uses the project/jobset name.

Builds: populate Builds.jobset_id in hydra-eval-jobset

24d0c2b

hydra-backfill-ids: create to add jobset_id values to Builds and Jobs

69a6f06

Vacuum every 10 iterations, update 10k at a time.

hydra-init: Warn about the schema version migration

887d628

Jobs.jobset_id: make not-null

788364d

Builds.jobset_id: make not-null

9001245

Jobsets.jobs: Fetch via Jobsets.id

64e87e8

Jobsets.builds: Fetch via Jobsets.id

1635611

Jobs.builds: Fetch via Jobs.jobset_id

ec176c1

LatestSucceededForJob{,set}: Filter with jobset_id

c61a1d6

grahamc force-pushed the jobset-id-pgsql branch from a484b03 to c61a1d6 Compare February 6, 2020 19:50

grahamc mentioned this pull request Feb 6, 2020

jobset_id, #710 Part 1 #711

Merged

6 tasks

edolstra approved these changes Feb 7, 2020

View reviewed changes

This was referenced Feb 9, 2020

Copy the flake migration from the flake branch #713

Merged

Jobset id pgsql part 2 #714

Merged

grahamc added a commit that referenced this pull request Feb 10, 2020

Merge pull request #711 from grahamc/jobset-id-pgsql-part-1

add4f61

jobset_id, #710 Part 1

grahamc added a commit that referenced this pull request Feb 11, 2020

fixup: d'oh, make the migrations from #710 part-2 sequential

8347934

grahamc closed this Feb 11, 2020

philandstuff mentioned this pull request May 16, 2020

Hydra channel for dhall-haskell failure dhall-lang/dhall-lang#1006

Closed

grahamc deleted the jobset-id-pgsql branch March 10, 2022 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give Jobsets an ID; add jobset_id to builds and jobs. #710

Give Jobsets an ID; add jobset_id to builds and jobs. #710

grahamc commented Feb 6, 2020 •

edited

edolstra Feb 7, 2020

grahamc Feb 7, 2020

knl commented Feb 7, 2020

grahamc commented Feb 7, 2020 •

edited

nixos-discourse commented Feb 7, 2020

grahamc commented Feb 9, 2020

grahamc commented Feb 10, 2020 •

edited

grahamc commented Feb 10, 2020

grahamc commented Feb 10, 2020 •

edited

grahamc commented Feb 10, 2020

grahamc commented Feb 10, 2020

grahamc commented Feb 10, 2020

grahamc commented Feb 11, 2020

grahamc commented Feb 11, 2020

grahamc commented Feb 11, 2020

grahamc commented Feb 11, 2020

Give Jobsets an ID; add jobset_id to builds and jobs. #710

Give Jobsets an ID; add jobset_id to builds and jobs. #710

Conversation

grahamc commented Feb 6, 2020 • edited

edolstra Feb 7, 2020

Choose a reason for hiding this comment

grahamc Feb 7, 2020

Choose a reason for hiding this comment

knl commented Feb 7, 2020

grahamc commented Feb 7, 2020 • edited

nixos-discourse commented Feb 7, 2020

grahamc commented Feb 9, 2020

grahamc commented Feb 10, 2020 • edited

Deployment of Part 1

grahamc commented Feb 10, 2020

grahamc commented Feb 10, 2020 • edited

grahamc commented Feb 10, 2020

grahamc commented Feb 10, 2020

grahamc commented Feb 10, 2020

grahamc commented Feb 11, 2020

grahamc commented Feb 11, 2020

grahamc commented Feb 11, 2020

grahamc commented Feb 11, 2020

grahamc commented Feb 6, 2020 •

edited

grahamc commented Feb 7, 2020 •

edited

grahamc commented Feb 10, 2020 •

edited

grahamc commented Feb 10, 2020 •

edited