Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to search builds by hash #654

Merged
merged 2 commits into from Jun 13, 2019
Merged

Conversation

nlewo
Copy link
Member

@nlewo nlewo commented Jun 5, 2019

Currently, a full store path has to be provided to search in
builds. This patch permits to search jobs with a output path or
derivation hash.

Usecase: we are building Docker images with Hydra. The tag of the
Docker image is the hash of the image output path. This patch would
allow us to find back the build job from the tag of a running
container image.

Currently, a full store path has to be provided to search in
builds. This patch permits to search jobs with a output path or
derivation hash.

Usecase: we are building Docker images with Hydra. The tag of the
Docker image is the hash of the image output path. This patch would
allow us to find back the build job from the tag of a running
container image.
@nlewo
Copy link
Member Author

nlewo commented Jun 5, 2019

Note I really don't know what could be the performance impact on search queries on hydra.nixos.org... Of course, on my Hydra instances, it's negligible.

@edolstra
Copy link
Member

edolstra commented Jun 5, 2019

Yeah it seems that this could require a sequential scan on the Builds table. Maybe you can do an explain analyze to see what the query is doing?

@nlewo
Copy link
Member Author

nlewo commented Jun 5, 2019

I confirm using a regex to search by drvPath requires a seq scan of the builds table while it is not the case with =. Regarding the search by path, it seems to me it already requires a seq scan of the builds table.

postgres=# explain analyze select * from builds,buildoutputs where buildoutputs.path = '/nix/store/3q84b8jzsnckipx5gxqzd6wnvzyn0wml-trivial';
 Nested Loop  (cost=0.00..39.38 rows=600 width=560) (actual time=0.137..0.192 rows=1 loops=1)
   ->  Seq Scan on builds  (cost=0.00..11.50 rows=150 width=492) (actual time=0.013..0.021 rows=1 loops=1)
   ->  Materialize  (cost=0.00..20.39 rows=4 width=68) (actual time=0.031..0.054 rows=1 loops=1)
         ->  Seq Scan on buildoutputs  (cost=0.00..20.38 rows=4 width=68) (actual time=0.013..0.021 rows=1 loops=1)
               Filter: (path = '/nix/store/3q84b8jzsnckipx5gxqzd6wnvzyn0wml-trivial'::text)

@nlewo
Copy link
Member Author

nlewo commented Jun 5, 2019

@edolstra We could create indexes supporting like operator with the module https://www.postgresql.org/docs/9.3/pgtrgm.html. I locally tried it and I get the following:

postgres=# explain analyze select * from builds where drvpath like '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on builds  (cost=124.00..128.01 rows=1 width=492) (actual time=0.065..0.084 rows=1 loops=1)
   Recheck Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text)
   ->  Bitmap Index Scan on trgm_idx_users_username  (cost=0.00..124.00 rows=1 width=0) (actual time=0.038..0.038 rows=1 loops=1)
         Index Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text)
 Total runtime: 0.146 ms
(5 rows)

What do you think?

The search query uses the LIKE operator which requires a sequential
scan (it can't use the already existing B-tree index). This new
index (trigram) avoids a sequential scan of the builds table when the
LIKE operator is used.

Here is the analyze of a request on the builds table with this index:

     explain analyze select * from builds where drvpath like '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%';
                                                                 QUERY PLAN
     -----------------------------------------------------------------------------------------------------------------------------------
      Bitmap Heap Scan on builds  (cost=128.00..132.01 rows=1 width=492) (actual time=0.070..0.077 rows=1 loops=1)
        Recheck Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text)
        ->  Bitmap Index Scan on indextrgmbuildsondrvpath  (cost=0.00..128.00 rows=1 width=0) (actual time=0.047..0.047 rows=3 loops=1)
              Index Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text)
      Total runtime: 0.206 ms
     (5 rows)
@nlewo
Copy link
Member Author

nlewo commented Jun 6, 2019

I added the creation of this index in the commit 7935cff.

@edolstra
Copy link
Member

How much disk space does the trigram index need?

@nlewo
Copy link
Member Author

nlewo commented Jun 12, 2019

Hm, I didn't find clear doc about the disk usage of this kind of
index. So, I did some small experiments:

In the following, I ran Postgresql 11.3 on my laptop (4 cores, 16G RAM).
I inserted 10 000 000 rows containing /nix/store/<UUID> using the following script:

CREATE TABLE Builds (
    drvPath       text not null
);
INSERT INTO builds (drvpath)
  SELECT ('/nix/store/' || md5(random()::text))
  FROM generate_series(1, 10000000);

The size of /var/lib/postgresql/data is 1.8G. The query

select * from builds where drvpath LIKE '%ded6%';

took about 1.5s to execute.

Then I created the pg_trgm index. The creation of the index took
2m30s. The above select query then took 53ms.
The size of /var/lib/postgresql/data is 2.5G after index creation.

For a table containing 100 000 000 rows, results are

  • initial /var/lib/postgresql/data size: 8.2G
  • /var/lib/postgresql/data size with index: 15G
  • time to create the index: 25m
  • query time without index: 56s
  • query time with index: 4s (400ms when the query is repeated)

I'm not sure what to conclude... It seems it consumes something close
to the size of the data it has to index (at least in the case I
considered).
How many rows does the Builds table of hydra.nixos.org contain?

@edolstra
Copy link
Member

It has ~90 million rows. You can use pg_indexes_size to get the size of the indexes for a table. For hydra.nixos.org this is already 249 GiB, hence my worry about adding another potentially large index:

hydra=> select pg_indexes_size('builds');
 pg_indexes_size 
-----------------
    267906842624
(1 row)

@nlewo
Copy link
Member Author

nlewo commented Jun 13, 2019

For a table containing 100 million rows (with one column containing /nix/store/<UUID>), the index uses 6.5 GiB:

postgres=# select * from builds limit 2;
                   drvpath                   
---------------------------------------------
 /nix/store/b45c80696fc5833abb53b16d03a50566
 /nix/store/1982122195e776931946fa78865f8805
(2 rows)

postgres=# select count(*) from builds;
   count   
-----------
 100000000
(1 row)

postgres=# select pg_indexes_size('builds');
 pg_indexes_size 
-----------------
      6968713216
(1 row)

@edolstra edolstra merged commit 2b4658b into NixOS:master Jun 13, 2019
@edolstra
Copy link
Member

Thanks, that sounds like a still acceptable increase.

@nlewo
Copy link
Member Author

nlewo commented Jun 13, 2019

Thx!

@grahamc
Copy link
Member

grahamc commented Dec 24, 2021

@nlewo when searching like this are you searching for partial hashes or full hashes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants