Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobset_id, #710 Part 1 #711

Merged
merged 7 commits into from Feb 10, 2020
Merged

Conversation

grahamc
Copy link
Member

@grahamc grahamc commented Feb 6, 2020

Ready to go.

The first half of #710, adding nullable jobset_id fields and populating them.

Currently marking this as a draft, here is my personal to-do list before merging:

  • Run the first half against a replica of production's database, and time how long the migration and backfiller takes
  • Run the evaluator a bit to add more rows
  • run hydra-backfill-ids to see if new rows with NULL IDs appear. if any bugs are found, fix & deploy those changes, repeating 5 & 6 until I see no more records with NULL jobset_id values.
  • Confirm that nix-build ./release.nix -A build -A manual -A tests.install -A tests.api -A tests.notifications works
  • Add the new index to the part-2 model.
  • merge the flakes migration in to master so everybody has a consistent collection of migration IDs.

@grahamc grahamc changed the title jobset_id, #710 Part 1 (wip) jobset_id, #710 Part 1 Feb 6, 2020
@grahamc
Copy link
Member Author

grahamc commented Feb 7, 2020

I had to move all the migrations down by 1 to account for a migration on the flakes branch.

These three migrations are pretty fast:

0s:

Feb 07 04:32:05 database-restore-test hydra-init[78336]: upgrading Hydra schema from version 58 to 59
Feb 07 04:32:05 database-restore-test hydra-init[78336]: executing SQL statement: ALTER TABLE Jobsets
Feb 07 04:32:05 database-restore-test hydra-init[78336]:   ADD COLUMN id SERIAL NOT NULL,
Feb 07 04:32:05 database-restore-test hydra-init[78336]:   ADD CONSTRAINT Jobsets_id_unique UNIQUE (id)
Feb 07 04:32:05 database-restore-test hydra-init[78336]: upgrading Hydra schema from version 59 to 60

2m20s:

Feb 07 04:32:05 database-restore-test hydra-init[78336]: upgrading Hydra schema from version 59 to 60
Feb 07 04:32:05 database-restore-test hydra-init[78336]: executing SQL statement: ALTER TABLE Builds
Feb 07 04:32:05 database-restore-test hydra-init[78336]:   ADD COLUMN jobset_id integer NULL,
Feb 07 04:32:05 database-restore-test hydra-init[78336]:   ADD FOREIGN KEY (jobset_id)
Feb 07 04:32:05 database-restore-test hydra-init[78336]:       REFERENCES Jobsets(id)
Feb 07 04:32:05 database-restore-test hydra-init[78336]:       ON DELETE CASCADE
Feb 07 04:34:25 database-restore-test hydra-init[78336]: upgrading Hydra schema from version 60 to 61

2s:

^[[CFeb 07 04:34:25 database-restore-test hydra-init[78336]: upgrading Hydra schema from version 60 to 61
Feb 07 04:34:25 database-restore-test hydra-init[78336]: executing SQL statement: ALTER TABLE Jobs
Feb 07 04:34:25 database-restore-test hydra-init[78336]:   ADD COLUMN jobset_id integer NULL,
Feb 07 04:34:25 database-restore-test hydra-init[78336]:   ADD FOREIGN KEY (jobset_id)
Feb 07 04:34:25 database-restore-test hydra-init[78336]:       REFERENCES Jobsets(id)
Feb 07 04:34:25 database-restore-test hydra-init[78336]:       ON DELETE CASCADE
Feb 07 04:34:26 database-restore-test systemd[1]: Started hydra-init.service

Starting the backfill and going to bed.

@grahamc
Copy link
Member Author

grahamc commented Feb 7, 2020

Since then it has migrated 4,200,000 rows, much too slowly. It appears to be running the 10 UPDATE queries in about 50 seconds and then running a 10 minute vacuum:

Fri Feb  7 11:50:39 UTC 2020    (batch #420) Vacuuming...
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_authid" --- only
superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_subscription" ---
 only superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_database" --- onl
y superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_db_role_setting"
--- only superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_tablespace" --- o
nly superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_pltemplate" --- o
nly superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_auth_members" ---
 only superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_shdepend" --- onl
y superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_shdescription" --
- only superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_replication_origi
n" --- only superuser can vacuum it
Fri Feb  7 11:59:54 UTC 2020    WARNING:  skipping "pg_shseclabel" --- o
nly superuser can vacuum it
Fri Feb  7 12:00:00 UTC 2020    (batch #421; 103746488 remaining) Builds
.jobset_id: affected 10000 rows...

I'm changing the numbers to only vacuum once every 100 iterations.

@grahamc
Copy link
Member Author

grahamc commented Feb 7, 2020

It migrated 3,000,000 rows in the last hour. I've changed it to vacuum every 500 iterations, since disk space is still well under control.

@grahamc
Copy link
Member Author

grahamc commented Feb 7, 2020

btw, before running this migration, the latest-finished url for nixos-unstable-small took over 8 minutes on an EPYC machine with 64G of RAM:

[root@database-restore-test:~]# time curl 127.0.0.1:3000/job/nixos/unstable-small/tested/latest-finished
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <title>Moved</title>
    </head>
    <body>
   <p>This item has moved <a href="http://127.0.0.1:3000/build/111728212">here</a>.</p>
</body>
</html>

real    8m20.782s
user    0m0.010s
sys     0m0.010s
(system stats, click to expand)
processor       : 47
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7401P 24-Core Processor
stepping        : 2
microcode       : 0x8001227
cpu MHz         : 2597.582
cache size      : 512 KB
physical id     : 0
siblings        : 48
core id         : 30
cpu cores       : 24
apicid          : 61
initial apicid  : 61
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 3992.40
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
[root@database-restore-test:~]# zpool list -v
NAME                                      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
npool                                    1.08T   619G   491G        -         -    33%    55%  1.00x    ONLINE  -
  wwn-0x55cd2e414e3c5b54-part2            111G   107G  4.15G        -         -    69%  96.2%      -  ONLINE
  wwn-0x55cd2e414e3c5c66                  111G   108G  3.29G        -         -    70%  97.0%      -  ONLINE
  ata-SSDSC2BB480G7R_PHDV7234021M480BGN   444G   203G   241G        -         -    25%  45.8%      -  ONLINE
  ata-SSDSC2BB480G7R_PHDV723401QT480BGN   444G   201G   243G        -         -    25%  45.3%      -  ONLINE
[root@database-restore-test:~]# free -g
              total        used        free      shared  buff/cache   available
Mem:             62          41           2          16          18           4
Swap:             0           0           0

@grahamc
Copy link
Member Author

grahamc commented Feb 7, 2020

in 1h45min, migrated 5,000,000

Fri Feb  7 14:12:27 UTC 2020    Total Builds records without a jobset_id: 100446488
[...]
Fri Feb  7 16:05:41 UTC 2020    (batch #500; 95456488 remaining) Builds.jobset_id: affected 10000 rows...
Fri Feb  7 16:05:41 UTC 2020    (batch #500) Vacuuming...

which suggests this will finish in about 33 hours hopefully :).

@grahamc
Copy link
Member Author

grahamc commented Feb 7, 2020

The next 500 iterations / 5,000,000 rows took 3h10min:

Fri Feb  7 16:21:41 UTC 2020    (batch #501; 95446488 remaining) Builds.jobset_id: affected 10000 rows...
[...]
Fri Feb  7 19:34:53 UTC 2020    (batch #1000; 90456488 remaining) Builds.jobset_id: affected 10000 rows...
Fri Feb  7 19:34:53 UTC 2020    (batch #1000) Vacuuming...

The slowdown is a bit spooky to me. Let's see how it progresses.

@grahamc
Copy link
Member Author

grahamc commented Feb 8, 2020

Just reached vacuum #3, nearly 6 hours later. It seems strange that this is taking increasing amounts of time. Maybe using a ratcheting minimum ID to update would make this faster.

@grahamc
Copy link
Member Author

grahamc commented Feb 8, 2020

The SELECT inside the UPDATE was taking an increasing amount of time as the first few million rows became non-null. I've done two things:

  1. started tracking the highest updated ID, and updating starting at that ID
  2. added a second pass to the updates which does not use "FOR UPDATE SKIP LOCKED" to update rows which were locked on the first pass.

This new run started at:

Sat Feb  8 04:13:16 UTC 2020    (pass 1/2) Backfilling unlocked Builds records where jobset_id is NULL...
Sat Feb  8 04:17:00 UTC 2020    (pass 1/2) Total Builds records without a jobset_id: 84434378, starting at 27072567
Sat Feb  8 04:17:04 UTC 2020    (pass 1/2) (batch #1; 84434378 remaining) Builds.jobset_id: affected 10000 rows; max ID: 27072567 -> 27082566
Sat Feb  8 04:17:06 UTC 2020    (pass 1/2) (batch #2; 84424378 remaining) Builds.jobset_id: affected 10000 rows; max ID: 27082566 -> 27092566
Sat Feb  8 04:17:08 UTC 2020    (pass 1/2) (batch #3; 84414378 remaining) Builds.jobset_id: affected 10000 rows; max ID: 27092566 -> 27102566
...
Sat Feb  8 04:19:13 UTC 2020    (pass 1/2) (batch #50; 83944378 remaining) Builds.jobset_id: affected 10000 rows; max ID: 27562566 -> 27572566
...
Sat Feb  8 04:34:49 UTC 2020    (pass 1/2) (batch #500; 79444378 remaining) Builds.jobset_id: affected 10000 rows; max ID: 32075985 -> 32085985
Sat Feb  8 04:34:49 UTC 2020    (pass 1/2) (batch #500) Vacuuming...

with MUCH faster 10k increments (~2-8s).

Incredibly, this managed to do 500 iterations (5,000,000 rows) in less than 20 minutes. If this continues, it might be done before I wake up.

@grahamc
Copy link
Member Author

grahamc commented Feb 8, 2020

I guess the VACUUM is taking longer than I guessed. But none the less:

Sat Feb  8 12:11:33 UTC 2020    (pass 1/2) (batch #5846; 25984378 remaining) Builds.jobset_id: affected 10000 rows; max ID: 85743196 -> 85753196

getting pretty close.

@grahamc
Copy link
Member Author

grahamc commented Feb 8, 2020

Sat Feb  8 16:33:47 UTC 2020    (pass 1/2) (batch #945; 0 remaining) Builds.jobset_id: affected 0 rows; max ID: 111736783 ->

Round 2 which covers locked rows is running now ... then we can try out part2 and see how the performance goes.

@grahamc
Copy link
Member Author

grahamc commented Feb 8, 2020

Finished about 30 minutes ago, including the final VACUUM. I'm thinking 24h is probably a good bet for how long it'll take in prod, but of course under actual production load it may be very different. At any rate, it can happen while it is up and running, and disk space never got out of control. Main objectives achieved :).

Still don't want to merge this until after validating part 2.

@grahamc
Copy link
Member Author

grahamc commented Feb 8, 2020

Part two's migrations:

jobs took just about a second. builds took just over 7 minutes, which is about what I expected given how long it took the backfiller to identify the lowest record ID without a jobset_id.

Feb 08 17:31:06 database-restore-test hydra-init[11105]: upgrading Hydra schema from version 61 to 62
Feb 08 17:31:06 database-restore-test hydra-init[11105]: executing SQL statement: ALTER TABLE Jobs
Feb 08 17:31:06 database-restore-test hydra-init[11105]:   ALTER COLUMN jobset_id SET NOT NULL
Feb 08 17:31:07 database-restore-test hydra-init[11105]: upgrading Hydra schema from version 62 to 63
Feb 08 17:31:07 database-restore-test hydra-init[11105]: executing SQL statement: ALTER TABLE Builds
Feb 08 17:31:07 database-restore-test hydra-init[11105]:   ALTER COLUMN jobset_id SET NOT NULL
Feb 08 17:38:34 database-restore-test systemd[1]: Started hydra-init.service.

@grahamc
Copy link
Member Author

grahamc commented Feb 8, 2020

[root@database-restore-test:~]# time curl 127.0.0.1:3000/job/nixos/unstable-small/tested/latest-finished
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <title>Moved</title>
    </head>
    <body>
   <p>This item has moved <a href="http://127.0.0.1:3000/build/111728212">here</a>.</p>
</body>
</html>

real    0m10.515s
user    0m0.004s
sys     0m0.004s

@grahamc
Copy link
Member Author

grahamc commented Feb 9, 2020

Part 2 got us to a 10 second query:

Feb 08 22:48:57 database-restore-test 60qz354sidg729bj3xx22w2qv2bb59wq-unit-script-postgresql-start[74051]: 2020-02-08 22:48:57.518 GMT [1517] LOG:  duration: 9561.487 ms  execute dbdpg_p77498_5: SELECT me.id, me.finished, me.timestamp, me.project, me.jobset, me.jobset_id, me.job, me.nixname, me.description, me.drvpath, me.system, me.license, me.homepage, me.maintainers, me.maxsilent, me.timeout, me.ischannel, me.iscurrent, me.nixexprinput, me.nixexprpath, me.priority, me.globalpriority, me.starttime, me.stoptime, me.iscachedbuild, me.buildstatus, me.size, me.closuresize, me.releasename, me.keep, me.notificationpendingsince FROM builds me LEFT JOIN jobsetevalmembers jobsetevalmembers ON jobsetevalmembers.build = me.id WHERE ( ( not exists (select 1 from jobsetevalmembers m2 join builds b2 on jobsetevalmembers.eval = m2.eval and m2.build = b2.id and b2.finished = 0) AND me.buildstatus = $1 AND me.finished = $2 AND me.job = $3 AND me.jobset_id = $4 ) ) ORDER BY id DESC LIMIT $5
Feb 08 22:48:57 database-restore-test 60qz354sidg729bj3xx22w2qv2bb59wq-unit-script-postgresql-start[74051]: 2020-02-08 22:48:57.518 GMT [1517] DETAIL:  parameters: $1 = '0', $2 = '1', $3 = 'tested', $4 = '90', $5 = '1'
EXPLAIN ANALYZE SELECT
    me.id, me.finished, me.timestamp, me.project, me.jobset,
    me.jobset_id, me.job, me.nixname, me.description, me.drvpath,
    me.system, me.license, me.homepage, me.maintainers, me.maxsilent,
    me.timeout, me.ischannel, me.iscurrent, me.nixexprinput,
    me.nixexprpath, me.priority, me.globalpriority, me.starttime,
    me.stoptime, me.iscachedbuild, me.buildstatus, me.size,
    me.closuresize, me.releasename, me.keep,
    me.notificationpendingsince
FROM builds me
LEFT JOIN jobsetevalmembers jobsetevalmembers
  ON jobsetevalmembers.build = me.id
  WHERE (
      (
          not exists (
              select 1
              from jobsetevalmembers m2
              join builds b2
                on jobsetevalmembers.eval = m2.eval
                  and m2.build = b2.id
                  and b2.finished = 0
          )
          AND me.buildstatus = '0'
          AND me.finished = '1'
          AND me.job = 'tested'
          AND me.jobset_id = '90'
      )
  )
ORDER BY id
DESC LIMIT 1;


                                                                                   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1001.29..323309.72 rows=1 width=414) (actual time=429.024..9694.885 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=1001.29..34165695.49 rows=106 width=414) (actual time=429.023..429.023 rows=1 loops=1)
         Filter: (NOT (SubPlan 1))
         ->  Gather Merge  (cost=1000.71..33986646.08 rows=19 width=414) (actual time=417.321..9683.174 rows=1 loops=1)
               Workers Planned: 8
               Workers Launched: 8
               ->  Parallel Index Scan Backward using builds_pkey on builds me  (cost=0.57..33985643.60 rows=2 width=414) (actual time=203.093..2290.133 rows=4 loops=9)
                     Filter: ((buildstatus = 0) AND (finished = 1) AND (job = 'tested'::text) AND (jobset_id = 90))
                     Rows Removed by Filter: 106750
         ->  Index Scan using indexjobsetevalmembersonbuild on jobsetevalmembers  (cost=0.58..4.76 rows=101 width=8) (actual time=0.044..0.045 rows=1 loops=1)
               Index Cond: (build = me.id)
         SubPlan 1
           ->  Merge Join  (cost=5.21..2998.59 rows=34 width=0) (actual time=11.638..11.639 rows=0 loops=1)
                 Merge Cond: (m2.build = b2.id)
                 ->  Index Only Scan using jobsetevalmembers_pkey on jobsetevalmembers m2  (cost=0.58..1325.87 rows=63780 width=4) (actual time=0.034..0.041 rows=48 loops=1)
                       Index Cond: (eval = jobsetevalmembers.eval)
                       Heap Fetches: 0
                 ->  Index Only Scan Backward using indexbuildsonfinishedid on builds b2  (cost=0.57..1397.12 rows=56420 width=4) (actual time=0.032..9.390 rows=27873 loops=1)
                       Index Cond: (finished = 0)
                       Heap Fetches: 1054
 Planning Time: 1.118 ms
 Execution Time: 9694.985 ms
(22 rows)

I didn't like this part:

         ->  Gather Merge  (cost=1000.71..33986646.08 rows=19 width=414) (actual time=417.321..9683.174 rows=1 loops=1)
               Workers Planned: 8
               Workers Launched: 8
               ->  Parallel Index Scan Backward using builds_pkey on builds me  (cost=0.57..33985643.60 rows=2 width=414) (actual time=203.093..2290.133 rows=4 loops=9)
                     Filter: ((buildstatus = 0) AND (finished = 1) AND (job = 'tested'::text) AND (jobset_id = 90))
                     Rows Removed by Filter: 106750

so I tried adding an index (took 10 minutes): create index IndexBuildsLatestFinished on Builds(id DESC, buildstatus, finished, job, jobset_id) where buildstatus = 0 and finished = 1;

and that turned the parallel gather merge in to an index scan, and brought the whole query to ~250ms (or less):

                                                                                   QUERY PLAN                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.27..18380.75 rows=1 width=414) (actual time=21.087..21.090 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=1.27..1911466.77 rows=104 width=414) (actual time=21.086..21.086 rows=1 loops=1)
         Filter: (NOT (SubPlan 1))
         ->  Index Scan using indexbuildslatestfinished on builds me  (cost=0.69..1568962.29 rows=19 width=414) (actual time=3.355..3.355 rows=1 loops=1)
               Index Cond: ((job = 'tested'::text) AND (jobset_id = 90))
         ->  Index Scan using indexjobsetevalmembersonbuild on jobsetevalmembers  (cost=0.58..4.76 rows=101 width=8) (actual time=0.050..0.050 rows=1 loops=1)
               Index Cond: (build = me.id)
         SubPlan 1
           ->  Merge Join  (cost=12.33..5659.45 rows=34 width=0) (actual time=17.671..17.671 rows=0 loops=1)
                 Merge Cond: (m2.build = b2.id)
                 ->  Index Only Scan using jobsetevalmembers_pkey on jobsetevalmembers m2  (cost=0.58..1327.86 rows=63784 width=4) (actual time=0.059..0.064 rows=48 loops=1)
                       Index Cond: (eval = jobsetevalmembers.eval)
                       Heap Fetches: 0
                 ->  Index Only Scan Backward using indexbuildsonfinishedid on builds b2  (cost=0.57..4097.36 rows=57622 width=4) (actual time=0.091..16.078 rows=27873 loops=1)
                       Index Cond: (finished = 0)
                       Heap Fetches: 1054
 Planning Time: 3.976 ms
 Execution Time: 21.172 ms
(18 rows)

@grahamc
Copy link
Member Author

grahamc commented Feb 9, 2020

hydra=# SELECT pg_size_pretty(pg_relation_size('indexbuildsonjobfinishedid'));
 pg_size_pretty
----------------
 20 GB
(1 row)

hydra=# SELECT pg_size_pretty(pg_relation_size('indexbuildslatestfinished'));
 pg_size_pretty
----------------
 5696 MB
(1 row)

(thanks to @LnL7 who suggested this)

@grahamc grahamc mentioned this pull request Feb 9, 2020
@grahamc
Copy link
Member Author

grahamc commented Feb 9, 2020

I think this is ready, and I've opened up #714 for part 2.

@grahamc grahamc changed the title (wip) jobset_id, #710 Part 1 jobset_id, #710 Part 1 Feb 9, 2020
A postgresql column which is non-null and unique is treated with
the same optimisations as a primary key, so we have no need to
try and recreate the `id` as the primary key.

No read paths are impacted by this change, and the database will
automatically create an ID for each insert. Thus, no code needs to
change.
Also, adds an explicitly named "jobs" accessor to the Jobsets
Schema object, which uses the project/jobset name.
Also, adds an explicitly named "builds" accessor to the Jobsets
Schema object, which uses the project/jobset name.
Vacuum every 10 iterations, update 10k at a time.
@grahamc grahamc merged commit add4f61 into NixOS:master Feb 10, 2020
@grahamc grahamc deleted the jobset-id-pgsql-part-1 branch February 10, 2020 16:43
@knl
Copy link
Contributor

knl commented Feb 12, 2020

@grahamc This is a really great improvement, thanks!

As the upgrade requires manual steps (due to the backfiller), I realized that anyone upgrading their Hydra will be hit by a big surprise, since there is no document in the repo listing the necessary steps. They are only discoverable by rummaging through closed PRs, which is not the best way to do it.

So, what would you say about starting to have proper, versioned releases with a changelog file documenting the changes, and, more importantly, breaking changes.

Ma27 added a commit to Ma27/nixpkgs that referenced this pull request Mar 28, 2020
Upgrades Hydra to the latest master/flake branch. To perform this
upgrade, it's needed to do a non-trivial db-migration which provides a
massive performance-improvement[1].

The basic ideas behind multi-step upgrades of services between NixOS versions
have been gathered already[2]. For further context it's recommended to
read this first.

Basically, the following steps are needed:

* Upgrade to a non-breaking version of Hydra with the db-changes
  (columns are still nullable here). If `system.stateVersion` is set to
  something older than 20.03, the package will be selected
  automatically, otherwise `pkgs.hydra-migration` needs to be used.

* Run `hydra-backfill-ids` on the server.

* Deploy either `pkgs.hydra-unstable` (for Hydra master) or
  `pkgs.hydra-flakes` (for flakes-support) to activate the optimization.

The steps are also documented in the release-notes and in the module
using `warnings`.

`pkgs.hydra` has been removed as latest Hydra doesn't compile with
`pkgs.nixStable` and to ensure a graceful migration using the newly
introduced packages.

To verify the approach, a simple vm-test has been added which verifies
the migration steps.

[1] NixOS/hydra#711
[2] NixOS#82353 (comment)
Ma27 added a commit to NixOS/nixpkgs that referenced this pull request Mar 28, 2020
Upgrades Hydra to the latest master/flake branch. To perform this
upgrade, it's needed to do a non-trivial db-migration which provides a
massive performance-improvement[1].

The basic ideas behind multi-step upgrades of services between NixOS versions
have been gathered already[2]. For further context it's recommended to
read this first.

Basically, the following steps are needed:

* Upgrade to a non-breaking version of Hydra with the db-changes
  (columns are still nullable here). If `system.stateVersion` is set to
  something older than 20.03, the package will be selected
  automatically, otherwise `pkgs.hydra-migration` needs to be used.

* Run `hydra-backfill-ids` on the server.

* Deploy either `pkgs.hydra-unstable` (for Hydra master) or
  `pkgs.hydra-flakes` (for flakes-support) to activate the optimization.

The steps are also documented in the release-notes and in the module
using `warnings`.

`pkgs.hydra` has been removed as latest Hydra doesn't compile with
`pkgs.nixStable` and to ensure a graceful migration using the newly
introduced packages.

To verify the approach, a simple vm-test has been added which verifies
the migration steps.

[1] NixOS/hydra#711
[2] #82353 (comment)

(cherry picked from commit bd5324c)
Emantor pushed a commit to Emantor/nixpkgs that referenced this pull request Mar 29, 2020
Upgrades Hydra to the latest master/flake branch. To perform this
upgrade, it's needed to do a non-trivial db-migration which provides a
massive performance-improvement[1].

The basic ideas behind multi-step upgrades of services between NixOS versions
have been gathered already[2]. For further context it's recommended to
read this first.

Basically, the following steps are needed:

* Upgrade to a non-breaking version of Hydra with the db-changes
  (columns are still nullable here). If `system.stateVersion` is set to
  something older than 20.03, the package will be selected
  automatically, otherwise `pkgs.hydra-migration` needs to be used.

* Run `hydra-backfill-ids` on the server.

* Deploy either `pkgs.hydra-unstable` (for Hydra master) or
  `pkgs.hydra-flakes` (for flakes-support) to activate the optimization.

The steps are also documented in the release-notes and in the module
using `warnings`.

`pkgs.hydra` has been removed as latest Hydra doesn't compile with
`pkgs.nixStable` and to ensure a graceful migration using the newly
introduced packages.

To verify the approach, a simple vm-test has been added which verifies
the migration steps.

[1] NixOS/hydra#711
[2] NixOS#82353 (comment)
@lopsided98
Copy link
Contributor

I'm seeing somewhat strange behavior while running hydra-backfill-ids. This is the beginning of the log:

(pass 1/2) Backfilling Jobs records where jobset_id is NULL...
(pass 1/2) Total Jobs records without a jobset_id: 11
(pass 1/2) (batch #1; 11 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #2; 8 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #3; 5 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #4; 2 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #5; -1 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #6; -4 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #7; -7 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #8; -10 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #9; -13 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #10; -16 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #11; -19 remaining) Jobs.jobset_id: affected 3 rows...
(pass 1/2) (batch #12; -22 remaining) Jobs.jobset_id: affected 3 rows...

The script claims that only 11 jobs need to be modified, causing the remaining count to quickly go negative. Also, the batches are only affecting 3 rows at a time.

@grahamc
Copy link
Member Author

grahamc commented Apr 17, 2020 via email

@lopsided98
Copy link
Contributor

Yes

@grahamc
Copy link
Member Author

grahamc commented Apr 17, 2020 via email

@lopsided98
Copy link
Contributor

No jobs are being added. The queue runner is stopped because I have been having some issues with my builders. There are 40 jobs in the queue right now.

I had the script running in the background for a few hours and it reached a remaining count of <-3,000,000 before I started to suspect something wasn't working right. I don't know what the remaining count was the first time I ran the script, but it has remained at 11 each time I have run it since.

Some queries:

hydra=> SELECT COUNT(*) FROM jobs WHERE jobset_id IS NULL;
 count 
-------
    11
(1 row)
hydra=> SELECT COUNT(*) FROM jobs;
 count 
-------
  9583
(1 row)

stigok pushed a commit to stigok/nixpkgs that referenced this pull request Jun 12, 2020
Upgrades Hydra to the latest master/flake branch. To perform this
upgrade, it's needed to do a non-trivial db-migration which provides a
massive performance-improvement[1].

The basic ideas behind multi-step upgrades of services between NixOS versions
have been gathered already[2]. For further context it's recommended to
read this first.

Basically, the following steps are needed:

* Upgrade to a non-breaking version of Hydra with the db-changes
  (columns are still nullable here). If `system.stateVersion` is set to
  something older than 20.03, the package will be selected
  automatically, otherwise `pkgs.hydra-migration` needs to be used.

* Run `hydra-backfill-ids` on the server.

* Deploy either `pkgs.hydra-unstable` (for Hydra master) or
  `pkgs.hydra-flakes` (for flakes-support) to activate the optimization.

The steps are also documented in the release-notes and in the module
using `warnings`.

`pkgs.hydra` has been removed as latest Hydra doesn't compile with
`pkgs.nixStable` and to ensure a graceful migration using the newly
introduced packages.

To verify the approach, a simple vm-test has been added which verifies
the migration steps.

[1] NixOS/hydra#711
[2] NixOS#82353 (comment)

(cherry picked from commit bd5324c)
@Taneb
Copy link
Contributor

Taneb commented Nov 4, 2020

I've hit an error running hydra-backfill-ids to perform the migration:

# hydra-backfill-ids
Beginning with a VACUUM
WARNING:  skipping "pg_authid" --- only superuser can vacuum it
WARNING:  skipping "pg_subscription" --- only superuser can vacuum it
WARNING:  skipping "pg_database" --- only superuser can vacuum it
WARNING:  skipping "pg_db_role_setting" --- only superuser can vacuum it
WARNING:  skipping "pg_tablespace" --- only superuser can vacuum it
WARNING:  skipping "pg_pltemplate" --- only superuser can vacuum it
WARNING:  skipping "pg_auth_members" --- only superuser can vacuum it
WARNING:  skipping "pg_shdepend" --- only superuser can vacuum it
WARNING:  skipping "pg_shdescription" --- only superuser can vacuum it
WARNING:  skipping "pg_replication_origin" --- only superuser can vacuum it
WARNING:  skipping "pg_shseclabel" --- only superuser can vacuum it
(pass 1/2) Backfilling Jobs records where jobset_id is NULL...
(pass 1/2) Total Jobs records without a jobset_id: 10579
main::backfillJobsJobsetId(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "jobs_pkey"
DETAIL:  Key (project, jobset, name)=(myrtlepkgs, staging, synchrotron.model.nx.c4v120.wavenet.layer-0) already exists. [for Statement "UPDATE jobs
SET jobset_id = (
  SELECT jobsets.id
  FROM jobsets
  WHERE jobsets.name = jobs.jobset
    AND jobsets.project = jobs.project
)
WHERE (jobs.project, jobs.jobset, jobs.name) in (
  SELECT jobsprime.project, jobsprime.jobset, jobsprime.name
  FROM jobs jobsprime
  WHERE jobsprime.jobset_id IS NULL
  FOR UPDATE SKIP LOCKED
  LIMIT ?
);
" with ParamValues: 1='10000'] at /nix/store/6nc2hs3lx3m4wc1j6jnjg2vlsld7rb7c-hydra-2020-02-10/bin/.hydra-backfill-ids-wrapped line 160

I'm getting this error consistently.

EDIT: manually removing the duplicate rows has fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants