Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/prometheus: support v2 #56037

Closed
wants to merge 2 commits into from
Closed

Conversation

eonpatapon
Copy link
Contributor

Motivation for this change

Be able to use prometheus v2 with the prometheus nixos module.

Since the alertmanager is configured differently between v1 and v2, there is different options for v1 and v2. Added assertions to check if the correct option is used depending on prometheus version.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

 * Added `alerting` option for prometheus v2
 * Added test for prometheus v2
@eonpatapon
Copy link
Contributor Author

ping @fpletz @Ma27 @andir as you contributed to this module

@andir
Copy link
Member

andir commented Feb 19, 2019

I do not fully recall the last conversation but shouldn't a user be able to run v1 and v2 (of prometheus, not really alertmanager) in order to still have access to historical data for some amount of time?

@bachp
Copy link
Member

bachp commented Feb 19, 2019

There is already #49802 but it was never completed.

I think the main issue why it was done as a seprate service there is that there is no migration path from prometheus 1 to 2.

@Ma27
Copy link
Member

Ma27 commented Feb 19, 2019

In case it helps, there's at least a draft for v2 in the mayflower fork: https://github.com/mayflower/nixpkgs/blob/master/nixos/modules/services/monitoring/prometheus/2.nix

I can do a more detailed review tomorrow, but I'm fairly certain that @fpletz and @andir are way more qualified here, so before considering a merge I'd like to have an ack from them :)

@fpletz
Copy link
Member

fpletz commented Feb 19, 2019

Since the 1.x branch doesn't seem to be maintained upstream anymore and isn't available for download via the official website I'd say we remove 1.x support completely instead of making our module even more complex than it already is. People will lose their data. There is just no way around it.

@eonpatapon
Copy link
Contributor Author

eonpatapon commented Feb 20, 2019

Since there is no easy migration path my idea was to not break existing setups for this first step but still allow users to setup prometheus v2 with the nixos module. We could add some deprecation messages also to indicate that support of v1 will be drop. In a second step we can remove it.

@bachp
Copy link
Member

bachp commented Feb 20, 2019

@fpletz I would be fine with completely removing v1.

@griff
Copy link
Contributor

griff commented Feb 20, 2019

I just stumbled across this PR while looking for prometheus 2 support and thought I would chime in with my personal preferences as a user of NixOS for breaking upgrades like this.

I would prefer that Prometheus 2 support got its own module that can run side by side with existing Prometheus 1 setups and then that a warning was added to the old module that it is deprecated and will be removed in the future. At a later date like a year from now we then remove the old module leaving only the new module.

I prefer this because it gives me fair warning that I need to take action but also gives me several options for what action I can take. It allows me to just nuke the old data like @fpletz prefers. It allows me to keep running the old version but with only historic data for the deprecation period and the new version with recent data. Or it allows me to just copy the deprecated module and take on supporting that module myself.

The one thing I never want though is for me to just lose data on an OS upgrade with no warning.

@bachp
Copy link
Member

bachp commented Feb 20, 2019

@griff I can understand your point. So maybe the best approach ist to take a copy-past approach where we copy the prometheus module as prometheus2. This way it can be improved independently of the prometheus module, which can be left as is.

@griff
Copy link
Contributor

griff commented Feb 21, 2019

@bachp I think that would be the best approach

@azazel75
Copy link
Contributor

Since the 1.x branch doesn't seem to be maintained upstream anymore and isn't available for download via the official website I'd say we remove 1.x support completely instead of making our module even more complex than it already is. People will lose their data. There is just no way around it.

My patch started as a complete removal of the prometheus 1 code as I thought there was an upgrading path from version 1 to 2, but there isn't any (at least for the storage). I then changed the code to have it in parallel, also convinced by the fact that the only partial upgrade path suggested by the prometheus guys is to run a non scraping version of the 1.x instance to keep historical data and sharing it with the 2.x instance.

Unfortunately I now haven't so much time to spend on it, but If you want to trash the Prometheus1 configuration, a Prometheus2-only patch is much smaller and I've it running on my server. From a distribution pow I think that while I don't use prometheus 1, having a previous distribution with the 1.x configuration only and the next with the 2.x only feels like the wrong thing to do to me.

@bachp
Copy link
Member

bachp commented Feb 21, 2019

From a distribution pow I think that while I don't use prometheus 1, having a previous distribution with the 1.x configuration only and the next with the 2.x only feels like the wrong thing to do to me.

@azazel75 Sorry, I don't understand what you want to say with that. Would having two modules prometheus and prometheus2 be an acceptable way?

@eonpatapon
Copy link
Contributor Author

I also agree that we shouldn't drop prometheus 1.x out of the blue. Existing configuration should continue to work.

My proposal was to have a working module for v1 or v2 but not both at the same time. Then v1 support could be droped after some deprecation period. This is enough for me but it looks like some people want to be able to run v1 and v2 in // to keep the old history, or to be able to scrape it in the new instance.

In this case I don't think there is any other choice than to provide 2 different modules, with different state path etc... like it was done on the other PR.

In either case we agree on I can work on it.

@azazel75
Copy link
Contributor

azazel75 commented Feb 21, 2019 via email

azazel75 added a commit to azazel75/nixpkgs that referenced this pull request Feb 21, 2019
As the configuration for the exporters and alertmanager is unchanged
between the two major versions this patch tries to minimize
duplication while at the same time as there's no upgrade path from 1.x
to 2.x, it allows running the two services in parallel. See also NixOS#56037
@azazel75
Copy link
Contributor

I've completed the rebasing if someone cares and wants to test it out

Copy link
Member

@Ma27 Ma27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a short look at the code. As I'm not a prometheus maintainer, I'm not sure how we shall proceed.

First of all I'm not sure if and when we should deprecate Prometheus 1.x support. The docs recommend to host a legacy Prometheus instance for legacy data. However the last release was in 2017 and I'm not sure how long we want to keep this in nixpkgs and when to start the deprecation phase.

An idea would be to split this into two modules, deprecate v1 support for 19.03 and replace the v1 module with a minimalistic setup which only serves data, but doesn't scrape anymore on unstable after the release.

But as I said, I'm not a prometheus maintainer and I don't have a strong opinion about this, the above is just an idea :)

- name: test
rules:
- record: testrule
expr: count(up{job="prometheus"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's IMHO more readable when you do something like this:

{
  rules = [
    (builtins.toJSON {
      groups = [
        { name = "test";
          # and so on
        }
      ];
    })
  ];
}

I'm also wondering if this could be implemented directly in the module (just as an attribute set, not as a submodule)

"-alertmanager.timeout=${toString cfg.alertmanagerTimeout}s"
(optionalString (cfg.alertmanagerURL != []) "-alertmanager.url=${concatStringsSep "," cfg.alertmanagerURL}")
(optionalString (cfg.webExternalUrl != null) "-web.external-url=${cfg.webExternalUrl}")
];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be simplified by not duplicating options that are used on both v1 and v2:

{
  cmdlineArgs = cfg.extraFlags ++ [
    # ...
  ] ++ (if v1 then [ /* ... */ ] else [ /* ... */ ]);
}

It's probably just personal preference, but IMHO the expression currently looks heavier than it actually is.

@@ -0,0 +1,35 @@
import ./make-test.nix {
name = "prometheus-2";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you didn't set meta.maintainers here?

@eonpatapon
Copy link
Contributor Author

To move forward there must be a consensus on the way to go, this PR allows to run v1 or v2 (not both), or PR #49802 that allows to run v1 and v2 side by side. I don't know how this kind of choice is usually made in nixpkgs.

@bachp
Copy link
Member

bachp commented Mar 2, 2019

@eonpatapon I think we should allow to run both in parallel as one of the main points of keeping both around is to allow users to implement what is recommended in the prometehus migration guide.

To retain access to your historic monitoring data we recommend you run a non-scraping Prometheus instance running at least version 1.8.1 in parallel with your Prometheus 2.0 instance [...]

If this is not possible we don't need to spend the effort and can directly replace 1 with 2.

@eonpatapon
Copy link
Contributor Author

Ok, I'm closing this in favour of #49802 then

@eonpatapon eonpatapon closed this Mar 3, 2019
jbgi pushed a commit to jbgi/nixpkgs that referenced this pull request Mar 25, 2019
As the configuration for the exporters and alertmanager is unchanged
between the two major versions this patch tries to minimize
duplication while at the same time as there's no upgrade path from 1.x
to 2.x, it allows running the two services in parallel. See also NixOS#56037
shmish111 pushed a commit to shmish111/nixpkgs that referenced this pull request Apr 8, 2019
As the configuration for the exporters and alertmanager is unchanged
between the two major versions this patch tries to minimize
duplication while at the same time as there's no upgrade path from 1.x
to 2.x, it allows running the two services in parallel. See also NixOS#56037

Make it pass a minimal test

Fix alertmanager service definition. Thanks to @eonpatapon

Rollback versionning of services.prometheus.{exporters, alertmanager}.

Prometheus2: --web.external-url need two dash.

Use same user for both prometheus 1 and 2. Use StateDirectory.

nixos/prometheus: use ExecStart instead of a shell script

This uses fewer lines of code and one less process.

nixos/prometheus: get rid of empty arguments

Previously the prometheus.service file looked like:

  ExecStart=/nix/store/wjkhfw3xgkmavz1akkqir99w4lbqhak7-prometheus-1.8.2-bin/bin/prometheus -storage.local.path=/var/lib/prometheus/metrics \
    -config.file=/nix/store/zsnvzw51mk3n1cxjd0351bj39k1j6j27-prometheus.yml-check-config-checked \
    -web.listen-address=0.0.0.0:9090 \
    -alertmanager.notification-queue-capacity=10000 \
    -alertmanager.timeout=10s \
     \

  Restart=always

Now it's:

  ExecStart=/nix/store/wjkhfw3xgkmavz1akkqir99w4lbqhak7-prometheus-1.8.2-bin/bin/prometheus \
    -storage.local.path=/var/lib/prometheus/metrics \
    -config.file=/nix/store/zsnvzw51mk3n1cxjd0351bj39k1j6j27-prometheus.yml-check-config-checked \
    -web.listen-address=0.0.0.0:9090 \
    -alertmanager.notification-queue-capacity=10000 \
    -alertmanager.timeout=10s
  Restart=always

nixos/tests: register the prometheus2 test
jbgi pushed a commit to input-output-hk/nixpkgs that referenced this pull request Apr 9, 2019
As the configuration for the exporters and alertmanager is unchanged
between the two major versions this patch tries to minimize
duplication while at the same time as there's no upgrade path from 1.x
to 2.x, it allows running the two services in parallel. See also NixOS#56037
basvandijk pushed a commit to basvandijk/nixpkgs that referenced this pull request Apr 10, 2019
As the configuration for the exporters and alertmanager is unchanged
between the two major versions this patch tries to minimize
duplication while at the same time as there's no upgrade path from 1.x
to 2.x, it allows running the two services in parallel. See also NixOS#56037

(cherry picked from commit 11b8972)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants