Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give Jobsets an ID; add jobset_id to builds and jobs. #710

Closed
wants to merge 13 commits into from

Conversation

grahamc
Copy link
Member

@grahamc grahamc commented Feb 6, 2020

Some background on this issue is available:


note: each commit here is pretty small and should be very easy to review.

Here is the proposed plan for this migration:

  1. Add an id to the Jobsets table: serial, non-null, unique.
    • Fast.
    • Postgresql will automatically backfill, and new writes will automatically have them too.
    • Code changes: None. Fully backwards and forward compatible.
  2. Add a jobset_id to the Jobs table: nullable, foreign key to Jobsets
    • Fast.
    • Code changes:
      • All places writing to Jobs should begin writing the jobset_id
  3. Add a jobset_id to the Builds table: nullable, foreign key to Jobsets (3f074388)
    • Fast.
    • Code changes:
  4. Backfill Jobs with jobset_id values.
    1. The naive way to backfill has many problems:
      1. Very slow
      2. Huge amount of time with a read lock.
      3. Rewrites the entire table on disk in one shot, causing a full 2x table bloat
    2. Solution:
      • Create a purpose-built tool to incrementally backfill the table:
        1. running this in a loop:

           UPDATE jobs
           SET jobset_id = (
             SELECT jobsets.id
             FROM jobsets
             WHERE jobsets.name = jobs.jobset
               AND jobsets.project = jobs.project
           )
           WHERE (jobs.project, jobs.jobset, jobs.name) in (
             SELECT jobsprime.project, jobsprime.jobset, jobsprime.name
             FROM jobs jobsprime
             WHERE jobsprime.jobset_id IS NULL
             FOR UPDATE SKIP LOCKED
             LIMIT 10000
           );
          
        2. Every N iterations, run VACUUM

      • Hydra can stay fully online during the entire migration
      • The subselect of a specific collection of IDs allows the write lock to only affect those rows.
      • VACUUM will prevent 2x table bloat from happening all at once
  5. Backfill Builds with jobset_id values.
    1. The naive way to backfill has many problems:
      1. Very slow
      2. Huge amount of time with a read lock.
      3. Rewrites the entire table on disk in one shot, causing a full 2x table bloat
    2. Solution:
      • Create a purpose-built tool to incrementally backfill the table:
        1. running this in a loop:

           UPDATE builds
           SET jobset_id = (
             SELECT jobsets.id
             FROM jobsets
             WHERE jobsets.name = builds.jobset
               AND jobsets.project = builds.project
           )
           WHERE builds.id in (
             SELECT buildprime.id
             FROM builds buildprime
             WHERE buildprime.jobset_id IS NULL
             ORDER BY buildprime.id
             FOR UPDATE SKIP LOCKED
             LIMIT 10000
           );
          
        2. Every N iterations, run VACUUM

      • Hydra can stay fully online during the entire migration
      • The subselect of a specific collection of IDs allows the write lock to only affect those rows.
      • VACUUM will prevent 2x table bloat from happening all at once
  6. Perform an explicit release and put this in to production, and run the backfill tool until all rows are updated.
    1. Monitor to see if new Builds or Jobs rows are added with null jobset_id fields, this would indicate a bug which needs to be fixed.
  7. Modify the Builds table, doing two things in one transaction:
    1. Assert that there are no rows with a null jobset_id, any rows with a null jobset_id is likely from a bug we should fix.
    2. Alter the Builds table to make jobset_id not-null. Hopefully not very slow, I think postgresql will only validate no rows have a null jobset_id.
  8. Modify the Jobs table, doing two things in one transaction:
    1. Assert that there are no rows with a null jobset_id, any rows with a null jobset_id is likely from a bug we should fix.
    2. Alter the Jobs table to make jobset_id not-null. Hopefully not very slow, I think postgresql will only validate no rows have a null jobset_id.
  9. Alter the read paths to read through the jobset_id field for Builds and Jobs.

Under this plan we would:

  1. merge the first half of the PR (up until Jobs.jobset_id: make not-null)
  2. deploy to hydra.nixos.org
  3. run hydra-init to add the nullable columns
  4. run hydra-backfill-ids
  5. once that is done, give the code a day or so in prod
  6. run hydra-backfill-ids to see if any new records have been created with NULL jobset_id columns
  7. if any bugs are found, fix & deploy those changes, repeating 5 & 6 until we see no more records with NULL jobset_id values.

THEN: merge the second half (migration making them not-null) and deploy to production.

This becomes a bit more complicated for users other than nixos.org. I have tried to make a clear warning in hydra-init. We might want to create a branch off of master after merging the first half, in case there are bugs found in the code after we have merged the second half.

Currently marking this as a draft, here is my personal to-do list before merging:

  • Run the first half against a replica of production's database, and time how long the migration and backfiller takes
  • Run the evaluator a bit to add more rows
  • Run the second half of the migration and confirm it both works and doesn't take a long time
  • Verify that the queries for the latest-finished URL completes quickly, and the EXPLAIN ANALYZE output is improved over the current state.
  • Open two PRs, one for the first and and the other for the second half.
  • Confirm for both of those PRs that nix-build ./release.nix -A build -A manual -A tests.install -A tests.api -A tests.notifications works a final time.
  • Copy the flake migration from the flake branch #713

A postgresql column which is non-null and unique is treated with
the same optimisations as a primary key, so we have no need to
try and recreate the `id` as the primary key.

No read paths are impacted by this change, and the database will
automatically create an ID for each insert. Thus, no code needs to
change.
Also, adds an explicitly named "jobs" accessor to the Jobsets
Schema object, which uses the project/jobset name.
Also, adds an explicitly named "builds" accessor to the Jobsets
Schema object, which uses the project/jobset name.
Vacuum every 10 iterations, update 10k at a time.
-- very important to figure out where the nullable columns came from.

ALTER TABLE Builds
ALTER COLUMN jobset_id SET NOT NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the project and jobset columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should eventually, but I'm a bit afraid to do that in this PR since it makes it basically impossible to undo if there is a mistake. Also, it means we need to be 100% certain every read path is updated. I think that is a good improvement to make for a follow-up PR, maybe within a few weeks of this being deployed. Sound okay?

@knl
Copy link
Contributor

knl commented Feb 7, 2020

What is the motivation for this change?

@grahamc
Copy link
Member Author

grahamc commented Feb 7, 2020

The query executed by the latest-finished link for a given job is very inefficient. The query for the tested job in the nixos-unstable-small jobset has started taking well over 3 minutes, and on an unloaded and fairly powerful test box took over 8 minutes.

Every slow query reported by our database server involves reading or writing to the Builds table.

Here are some table sizes from hydra.nixos.org:

select count(*) from jobs; => 4,642,712
select count(*) from builds; => 107,935,932
select count(*) from jobsetevalmembers; => 1,176,106,887

Here is the EXPLAIN ANALYZE for the latest-finished query, which took 230s last Tuesday. This query has been "slow" 9,294 times since Jan 07 21:18:30.

    EXPLAIN ANALYZE SELECT me.id, me.finished, me.timestamp, me.project, me.jobset, me.job,
        me.nixname, me.description, me.drvpath, me.system, me.license, me.homepage,
        me.maintainers, me.maxsilent, me.timeout, me.ischannel, me.iscurrent,
        me.nixexprinput, me.nixexprpath, me.priority, me.globalpriority,
        me.starttime, me.stoptime, me.iscachedbuild, me.buildstatus, me.size,
        me.closuresize, me.releasename, me.keep, me.notificationpendingsince
    FROM Builds me
    LEFT JOIN JobsetEvalMembers jobsetevalmembers
        ON jobsetevalmembers.build = me.id
        WHERE (
            (
                not exists (
                    select 1 from jobsetevalmembers m2
                    join builds b2
                        on jobsetevalmembers.eval = m2.eval
                            and m2.build = b2.id
                            and b2.finished = 0
                )
                AND me.buildstatus = '0'
                AND me.finished = '1'
                AND me.job = 'tested'
                AND me.jobset = 'unstable-small'
                AND me.project = 'nixos'
            )
        )
    ORDER BY id DESC LIMIT '1'
   
    Limit  (cost=2639.88..2639.88 rows=1 width=413) (actual time=232894.816..232894.818 rows=1 loops=1)
       ->  Sort  (cost=2639.88..2639.97 rows=37 width=413) (actual time=232894.815..232894.815 rows=1 loops=1)
             Sort Key: me.id DESC
             Sort Method: top-N heapsort  Memory: 25kB
             ->  Nested Loop Left Join  (cost=1.27..2639.70 rows=37 width=413) (actual time=90.700..232885.490 rows=3884 loops=1)
                   Filter: (NOT (SubPlan 1))
                   ->  Index Scan using indexbuildsonjobfinishedid on builds me  (cost=0.69..8.70 rows=6 width=413) (actual time=0.056..36.171 rows=3884 loops=1)
                         Index Cond: ((project = 'nixos'::text) AND (jobset = 'unstable-small'::text) AND (job = 'tested'::text) AND (finished = 1))
                         Filter: (buildstatus = 0)
                         Rows Removed by Filter: 2830
                   ->  Index Scan using indexjobsetevalmembersonbuild on jobsetevalmembers  (cost=0.58..4.74 rows=100 width=8) (actual time=0.010..0.011 rows=1 loops=3884)
                         Index Cond: (build = me.id)
                   SubPlan 1
                     ->  Nested Loop  (cost=0.99..4.33 rows=1 width=0) (actual time=59.936..59.936 rows=0 loops=3884)
                           ->  Index Scan using indexbuildsonfinished on builds b2  (cost=0.41..1.53 rows=1 width=4) (actual time=0.005..22.836 rows=33852 loops=3884)
                           ->  Index Only Scan using jobsetevalmembers_pkey on jobsetevalmembers m2  (cost=0.58..2.80 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=131481168)
                                 Index Cond: ((eval = jobsetevalmembers.eval) AND (build = b2.id))
                                 Heap Fetches: 0
     Planning Time: 1.731 ms
     Execution Time: 232894.859 ms
    (20 rows)

In prior tests, changing this to use a jobset_id on the Builds table was able to significantly improve this query's performance: returning in just 26.727 ms. Before merging these changes, I will be validating that these changes actually has the performance improvement we're hoping for.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/upcoming-hydra-schema-change/5806/1

@grahamc
Copy link
Member Author

grahamc commented Feb 9, 2020

I've opened #714 for part two. Both part 1 and part 2 work quite nicely on my replica from production.

grahamc added a commit that referenced this pull request Feb 10, 2020
@grahamc
Copy link
Member Author

grahamc commented Feb 10, 2020

Deployment of Part 1

Part 1 is deployed and the backfiller is running as of Mon Feb 10 18:12:38 CET 2020.

The first vacuum finished at 18:43, and now it is updating the Jobs table in batches of 10,000 at about 3-4 batches per second. Looks like there are 3.5 million rows.

@grahamc
Copy link
Member Author

grahamc commented Feb 10, 2020

Pass 2 of the Jobs table finished at Mon Feb 10 18:46:08 CET 2020. Now on to the Builds table, this is the big one. Counting the number of rows itself takes an appreciable amount of time (84 seconds.)

The builds table is able to do a 10k batch in about 2-3 seconds.108,403,861 rows to do, ~108,303,900 rows remaining.

@grahamc
Copy link
Member Author

grahamc commented Feb 10, 2020

It has started the first VACUUM at Mon Feb 10 19:04:06 CET 2020, averaging about 2.5s per 10k batch. VACUUM finished at Mon Feb 10 19:17:43 CET 2020.

@grahamc
Copy link
Member Author

grahamc commented Feb 10, 2020

The third VACUUM just started (Mon Feb 10 19:52:18 CET 2020.) We have 93,403,861 rows remaining. This server is handling the migration much faster than my test box. We will very plausibly finish this in less than 8 hours.

@grahamc
Copy link
Member Author

grahamc commented Feb 10, 2020

The sixth vacuum started:

Mon Feb 10 21:19:24 CET 2020    (pass 1/2) (batch #3000) Vacuuming...

Currently updated from IDs 1 to 33,581,838, and 78,403,861 IDs to go.

@grahamc
Copy link
Member Author

grahamc commented Feb 10, 2020

We're half way:

Mon Feb 10 23:49:48 CET 2020 (pass 1/2) (batch #5281; 55603861 remaining) Builds.jobset_id: affected 10000 rows; max ID: 56486161 -> 56496161

so another 5-6 h I guess.

@grahamc
Copy link
Member Author

grahamc commented Feb 11, 2020

Tue Feb 11 03:33:26 CET 2020    (pass 1/2) (batch #8500; 23413861 remaining) Builds.jobset_id: affected 10000 rows; max ID: 88771122 -> 88781122
Tue Feb 11 03:33:26 CET 2020    (pass 1/2) (batch #8500) Vacuuming...

@grahamc
Copy link
Member Author

grahamc commented Feb 11, 2020

Deploy of part 1 finished

Tue Feb 11 06:50:52 CET 2020    (pass 1/2) Backfilling Jobs records where jobset_id is NULL...
Tue Feb 11 06:50:54 CET 2020    (pass 1/2) Total Jobs records without a jobset_id: 0
Tue Feb 11 06:50:55 CET 2020    (pass 1/2) (batch #1; 0 remaining) Jobs.jobset_id: affected 0E0 rows...
Tue Feb 11 06:50:55 CET 2020    (pass 2/2) Backfilling Jobs records where jobset_id is NULL...
Tue Feb 11 06:50:55 CET 2020    (pass 2/2) Total Jobs records without a jobset_id: 0
Tue Feb 11 06:50:55 CET 2020    (pass 2/2) (batch #1; 0 remaining) Jobs.jobset_id: affected 0E0 rows...
Tue Feb 11 06:50:55 CET 2020    (pass 1/2) Backfilling unlocked Builds records where jobset_id is NULL...
Tue Feb 11 07:03:50 CET 2020    (pass 1/2) Total Builds records without a jobset_id: 0, starting at
Tue Feb 11 07:03:50 CET 2020    (pass 1/2) (batch #1; 0 remaining) Builds.jobset_id: affected 0 rows; max ID:  ->
Tue Feb 11 07:03:50 CET 2020    (pass 2/2) Backfilling all Builds records where jobset_id is NULL...
Tue Feb 11 07:14:14 CET 2020    (pass 2/2) Total Builds records without a jobset_id: 0, starting at
Tue Feb 11 07:14:14 CET 2020    (pass 2/2) (batch #1; 0 remaining) Builds.jobset_id: affected 0 rows; max ID:  ->
Tue Feb 11 07:14:14 CET 2020    Ending with a VACUUM

@grahamc
Copy link
Member Author

grahamc commented Feb 11, 2020

After a minor oopsie: 8347934

we're deploying part 2:

Feb 11 14:38:49 ceres systemd[1]: Starting hydra-init.service...
Feb 11 14:38:50 ceres hydra-init[2203]: upgrading Hydra schema from version 61 to 62
Feb 11 14:38:50 ceres hydra-init[2203]: executing SQL statement: ALTER TABLE Jobs
Feb 11 14:38:50 ceres hydra-init[2203]:   ALTER COLUMN jobset_id SET NOT NULL
Feb 11 14:38:51 ceres hydra-init[2203]: upgrading Hydra schema from version 62 to 63
Feb 11 14:38:51 ceres hydra-init[2203]: executing SQL statement: ALTER TABLE Builds
Feb 11 14:38:51 ceres hydra-init[2203]:   ALTER COLUMN jobset_id SET NOT NULL

@grahamc
Copy link
Member Author

grahamc commented Feb 11, 2020

We're done, part 1 and 2 merged and deployed. Initially, the latest-finished URL for nixos-unstable-small returned in ~500ms, though now is not responding -- I think because this involved a full restart of Hydra, which is a very heavy procedure. I'll give it another try in a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants