Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuildOutputs: index path with HASH #846

Merged
merged 1 commit into from Jan 18, 2021

Conversation

grahamc
Copy link
Member

@grahamc grahamc commented Jan 18, 2021

Looking at AWS' Performance Insights for a Hydra instance, I found
the hydra-queue-runner's query:

select id, buildStatus, releaseName, closureSize, size
from Builds b
join BuildOutputs o on b.id = o.build
where
  finished = ?
  and (buildStatus = ? or buildStatus = ?)
  and path = $1

was the slowest query by at least 10x. Running an explain on this
showed why:

hydra=> explain select id, buildStatus, releaseName, closureSize, size
from Builds b join BuildOutputs o on b.id = o.build where
finished = 1 and (buildStatus = 0 or buildStatus = 6) and
path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';

                                                 QUERY PLAN
------------------------------------------------------------------------
 Gather  (cost=1000.43..33718.98 rows=2 width=56)
   Workers Planned: 2
   ->  Nested Loop  (cost=0.43..32718.78 rows=1 width=56)
         ->  Parallel Seq Scan on buildoutputs o  (cost=0.00..32710.32
                                                   rows=1
                                                    width=4)
               Filter: (path = '/nix/store/s93kh...snip...'::text)
         ->  Index Scan using indexbuildsonjobsetidfinishedid on builds b
                                        (cost=0.43..8.45 rows=1 width=56)
               Index Cond: ((id = o.build) AND (finished = 1))
               Filter: ((buildstatus = 0) OR (buildstatus = 6))
(8 rows)

A parallel sequential scan is definitely better than a sequential scan, but the
cost ranging from 0 to 32710 is not great. Looking at the table, I saw the path
column is completely unindexed:

hydra=> \d buildoutputs
            Table "public.buildoutputs"
Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
build  | integer |           | not null |
name   | text    |           | not null |
path   | text    |           | not null |
Indexes:
    "buildoutputs_pkey" PRIMARY KEY, btree (build, name)
Foreign-key constraints:
    "buildoutputs_build_fkey" FOREIGN KEY (build) REFERENCES builds(id)
        ON DELETE CASCADE

Since we always do exact matches on the path and don't care about ordering,
and since the path column is very high cardinality a hash index is a
good candidate. Note that I did test a btree index and it performed
similarly well, but slightly worse.

After creating the index (this took about 10 seconds) on a test database:

create index IndexBuildOutputsPath on BuildOutputs using hash(path);

We get a significantly reduced cost:

hydra=> explain select id, buildStatus, releaseName, closureSize, size
hydra->     from Builds b join BuildOutputs o on b.id = o.build where
hydra->     finished = 1 and (buildStatus = 0 or buildStatus = 6) and
hydra->     path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';
                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------
Nested Loop  (cost=0.43..41.41 rows=2 width=56)
->  Index Scan using buildoutputs_path_hash on buildoutputs o  (cost=0.00..16.05 rows=3 width=4)
        Index Cond: (path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1'::text)
->  Index Scan using indexbuildsonjobsetidfinishedid on builds b  (cost=0.43..8.45 rows=1 width=56)
        Index Cond: ((id = o.build) AND (finished = 1))
        Filter: ((buildstatus = 0) OR (buildstatus = 6))
(6 rows)

For direct comparison, the overall query plan was changed:

From: Gather      (cost=1000.43..33718.98 rows=2 width=56)
To:   Nested Loop (cost=   0.43.....41.41 rows=2 width=56)

and the query plan for buildoutputs changed from a maximum cost of
32,710 down to 16.

In practical terms, the query's planning and execution time was reduced:

    Before (ms) | Try 1   | Try 2   | Try 3
    ------------+---------+---------+--------
    Planning    |   0.898 |   0.416 |   0.383
    Execution   | 138.644 | 172.331 | 375.585

    After (ms)  | Try 1   | Try 2   | Try 3
    ------------+---------+---------+--------
    Planning    |   0.298 |   0.290 |   0.296
    Execution   | 219.625 |   0.035 |   0.034

In the production hydra instance I tested this on I found the average ms/call went from 264ms to 0.06ms/call. This query is used heavily by the queue runner to determine if the build job's output has been cached by hydra already.

Looking at AWS' Performance Insights for a Hydra instance, I found
the hydra-queue-runner's query:

    select id, buildStatus, releaseName, closureSize, size
    from Builds b
    join BuildOutputs o on b.id = o.build
    where
      finished = ?
      and (buildStatus = ? or buildStatus = ?)
      and path = $1

was the slowest query by at least 10x. Running an explain on this
showed why:

hydra=> explain select id, buildStatus, releaseName, closureSize, size
    from Builds b join BuildOutputs o on b.id = o.build where
    finished = 1 and (buildStatus = 0 or buildStatus = 6) and
    path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';

                                                     QUERY PLAN
    ------------------------------------------------------------------------
     Gather  (cost=1000.43..33718.98 rows=2 width=56)
       Workers Planned: 2
       ->  Nested Loop  (cost=0.43..32718.78 rows=1 width=56)
             ->  Parallel Seq Scan on buildoutputs o  (cost=0.00..32710.32
                                                       rows=1
                                                        width=4)
                   Filter: (path = '/nix/store/s93kh...snip...'::text)
             ->  Index Scan using indexbuildsonjobsetidfinishedid on builds b
                                            (cost=0.43..8.45 rows=1 width=56)
                   Index Cond: ((id = o.build) AND (finished = 1))
                   Filter: ((buildstatus = 0) OR (buildstatus = 6))
    (8 rows)

A paralell sequential scan is definitely better than a sequential scan, but the
cost ranging from 0 to 32710 is not great. Looking at the table, I saw the `path`
column is completely unindex:

    hydra=> \d buildoutputs
                Table "public.buildoutputs"
    Column |  Type   | Collation | Nullable | Default
    --------+---------+-----------+----------+---------
    build  | integer |           | not null |
    name   | text    |           | not null |
    path   | text    |           | not null |
    Indexes:
        "buildoutputs_pkey" PRIMARY KEY, btree (build, name)
    Foreign-key constraints:
        "buildoutputs_build_fkey" FOREIGN KEY (build) REFERENCES builds(id)
            ON DELETE CASCADE

Since we always do exact matches on the path and don't care about ordering,
and since the path column is very high cardinality a `hash` index is a
good candidate. Note that I did test a btree index and it performed
similarly well, but slightly worse.

After creating the index (this took about 10 seconds) on a test database:

    create index IndexBuildOutputsPath on BuildOutputs using hash(path);

We get a *significantly* reduced cost:

    hydra=> explain select id, buildStatus, releaseName, closureSize, size
    hydra->     from Builds b join BuildOutputs o on b.id = o.build where
    hydra->     finished = 1 and (buildStatus = 0 or buildStatus = 6) and
    hydra->     path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';
                                                QUERY PLAN
    -------------------------------------------------------------------------------------------------------
    Nested Loop  (cost=0.43..41.41 rows=2 width=56)
    ->  Index Scan using buildoutputs_path_hash on buildoutputs o  (cost=0.00..16.05 rows=3 width=4)
            Index Cond: (path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1'::text)
    ->  Index Scan using indexbuildsonjobsetidfinishedid on builds b  (cost=0.43..8.45 rows=1 width=56)
            Index Cond: ((id = o.build) AND (finished = 1))
            Filter: ((buildstatus = 0) OR (buildstatus = 6))
    (6 rows)

For direct comparison, the overall query plan was changed:

    From: Gather      (cost=1000.43..33718.98 rows=2 width=56)
    To:   Nested Loop (cost=   0.43.....41.41 rows=2 width=56)

and the query plan for buildoutputs changed from a maximum cost of
32,710 down to 16.

In practical terms, the query's planning and execution time was reduced:

Before (ms) | Try 1   | Try 2   | Try 3
------------+---------+---------+--------
Planning    |   0.898 |   0.416 |   0.383
Execution   | 138.644 | 172.331 | 375.585

After (ms)  | Try 1   | Try 2   | Try 3
------------+---------+---------+--------
Planning    |   0.298 |   0.290 |   0.296
Execution   | 219.625 |   0.035 |   0.034
Copy link
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hydra perf sure felt like there was some super low hanging fruit left. Thanks for finding!

@edolstra edolstra merged commit 6bb876c into NixOS:master Jan 18, 2021
@grahamc grahamc deleted the buildoutputs-index-hash-path branch January 18, 2021 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants