Allow to search builds by hash #654

nlewo · 2019-06-05T10:53:03Z

Currently, a full store path has to be provided to search in
builds. This patch permits to search jobs with a output path or
derivation hash.

Usecase: we are building Docker images with Hydra. The tag of the
Docker image is the hash of the image output path. This patch would
allow us to find back the build job from the tag of a running
container image.

Currently, a full store path has to be provided to search in builds. This patch permits to search jobs with a output path or derivation hash. Usecase: we are building Docker images with Hydra. The tag of the Docker image is the hash of the image output path. This patch would allow us to find back the build job from the tag of a running container image.

nlewo · 2019-06-05T10:57:32Z

Note I really don't know what could be the performance impact on search queries on hydra.nixos.org... Of course, on my Hydra instances, it's negligible.

edolstra · 2019-06-05T11:37:43Z

Yeah it seems that this could require a sequential scan on the Builds table. Maybe you can do an explain analyze to see what the query is doing?

nlewo · 2019-06-05T14:17:56Z

I confirm using a regex to search by drvPath requires a seq scan of the builds table while it is not the case with =. Regarding the search by path, it seems to me it already requires a seq scan of the builds table.

postgres=# explain analyze select * from builds,buildoutputs where buildoutputs.path = '/nix/store/3q84b8jzsnckipx5gxqzd6wnvzyn0wml-trivial';
 Nested Loop  (cost=0.00..39.38 rows=600 width=560) (actual time=0.137..0.192 rows=1 loops=1)
   ->  Seq Scan on builds  (cost=0.00..11.50 rows=150 width=492) (actual time=0.013..0.021 rows=1 loops=1)
   ->  Materialize  (cost=0.00..20.39 rows=4 width=68) (actual time=0.031..0.054 rows=1 loops=1)
         ->  Seq Scan on buildoutputs  (cost=0.00..20.38 rows=4 width=68) (actual time=0.013..0.021 rows=1 loops=1)
               Filter: (path = '/nix/store/3q84b8jzsnckipx5gxqzd6wnvzyn0wml-trivial'::text)

nlewo · 2019-06-05T16:06:16Z

@edolstra We could create indexes supporting like operator with the module https://www.postgresql.org/docs/9.3/pgtrgm.html. I locally tried it and I get the following:

postgres=# explain analyze select * from builds where drvpath like '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on builds  (cost=124.00..128.01 rows=1 width=492) (actual time=0.065..0.084 rows=1 loops=1)
   Recheck Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text)
   ->  Bitmap Index Scan on trgm_idx_users_username  (cost=0.00..124.00 rows=1 width=0) (actual time=0.038..0.038 rows=1 loops=1)
         Index Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text)
 Total runtime: 0.146 ms
(5 rows)

What do you think?

The search query uses the LIKE operator which requires a sequential scan (it can't use the already existing B-tree index). This new index (trigram) avoids a sequential scan of the builds table when the LIKE operator is used. Here is the analyze of a request on the builds table with this index: explain analyze select * from builds where drvpath like '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on builds (cost=128.00..132.01 rows=1 width=492) (actual time=0.070..0.077 rows=1 loops=1) Recheck Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text) -> Bitmap Index Scan on indextrgmbuildsondrvpath (cost=0.00..128.00 rows=1 width=0) (actual time=0.047..0.047 rows=3 loops=1) Index Cond: (drvpath ~~ '%k3r71gz0gv16ld8rhcp2bb8gb5w1xc4b%'::text) Total runtime: 0.206 ms (5 rows)

nlewo · 2019-06-06T12:40:01Z

I added the creation of this index in the commit 7935cff.

edolstra · 2019-06-12T07:36:58Z

How much disk space does the trigram index need?

nlewo · 2019-06-12T13:18:34Z

Hm, I didn't find clear doc about the disk usage of this kind of
index. So, I did some small experiments:

In the following, I ran Postgresql 11.3 on my laptop (4 cores, 16G RAM).
I inserted 10 000 000 rows containing /nix/store/<UUID> using the following script:

CREATE TABLE Builds (
    drvPath       text not null
);
INSERT INTO builds (drvpath)
  SELECT ('/nix/store/' || md5(random()::text))
  FROM generate_series(1, 10000000);

The size of /var/lib/postgresql/data is 1.8G. The query

select * from builds where drvpath LIKE '%ded6%';

took about 1.5s to execute.

Then I created the pg_trgm index. The creation of the index took
2m30s. The above select query then took 53ms.
The size of /var/lib/postgresql/data is 2.5G after index creation.

For a table containing 100 000 000 rows, results are

initial /var/lib/postgresql/data size: 8.2G
/var/lib/postgresql/data size with index: 15G
time to create the index: 25m
query time without index: 56s
query time with index: 4s (400ms when the query is repeated)

I'm not sure what to conclude... It seems it consumes something close
to the size of the data it has to index (at least in the case I
considered).
How many rows does the Builds table of hydra.nixos.org contain?

edolstra · 2019-06-12T14:55:27Z

It has ~90 million rows. You can use pg_indexes_size to get the size of the indexes for a table. For hydra.nixos.org this is already 249 GiB, hence my worry about adding another potentially large index:

hydra=> select pg_indexes_size('builds');
 pg_indexes_size 
-----------------
    267906842624
(1 row)

nlewo · 2019-06-13T08:44:05Z

For a table containing 100 million rows (with one column containing /nix/store/<UUID>), the index uses 6.5 GiB:

postgres=# select * from builds limit 2;
                   drvpath                   
---------------------------------------------
 /nix/store/b45c80696fc5833abb53b16d03a50566
 /nix/store/1982122195e776931946fa78865f8805
(2 rows)

postgres=# select count(*) from builds;
   count   
-----------
 100000000
(1 row)

postgres=# select pg_indexes_size('builds');
 pg_indexes_size 
-----------------
      6968713216
(1 row)

edolstra · 2019-06-13T12:37:03Z

Thanks, that sounds like a still acceptable increase.

nlewo · 2019-06-13T13:06:32Z

Thx!

grahamc · 2021-12-24T16:51:24Z

@nlewo when searching like this are you searching for partial hashes or full hashes?

edolstra merged commit 2b4658b into NixOS:master Jun 13, 2019

ius mentioned this pull request Jan 11, 2022

search feature often times out and results in error 500 #901

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to search builds by hash #654

Allow to search builds by hash #654

nlewo commented Jun 5, 2019

nlewo commented Jun 5, 2019

edolstra commented Jun 5, 2019

nlewo commented Jun 5, 2019

nlewo commented Jun 5, 2019

nlewo commented Jun 6, 2019

edolstra commented Jun 12, 2019

nlewo commented Jun 12, 2019 •

edited

Loading

edolstra commented Jun 12, 2019

nlewo commented Jun 13, 2019

edolstra commented Jun 13, 2019

nlewo commented Jun 13, 2019

grahamc commented Dec 24, 2021

Allow to search builds by hash #654

Allow to search builds by hash #654

Conversation

nlewo commented Jun 5, 2019

nlewo commented Jun 5, 2019

edolstra commented Jun 5, 2019

nlewo commented Jun 5, 2019

nlewo commented Jun 5, 2019

nlewo commented Jun 6, 2019

edolstra commented Jun 12, 2019

nlewo commented Jun 12, 2019 • edited Loading

edolstra commented Jun 12, 2019

nlewo commented Jun 13, 2019

edolstra commented Jun 13, 2019

nlewo commented Jun 13, 2019

grahamc commented Dec 24, 2021

nlewo commented Jun 12, 2019 •

edited

Loading