Enable prometheus node exporter by default #19

Merged
nerf merged 5 commits from Gonne/nixConfig:prometheusNodeExporter into main 2023-11-06 14:30:33 +00:00
Owner
No description provided.
Gonne added 1 commit 2023-10-23 17:41:32 +00:00
Author
Owner

Before merging, we should look through the list of exporters and choose the ones we like. CC @dsimon

Before merging, we should look through the list of exporters and choose the ones we like. CC @dsimon
Gonne changed title from Enable prometheus node exporter by default to WIP: Enable prometheus node exporter by default 2023-10-25 10:24:28 +00:00
Gonne force-pushed prometheusNodeExporter from b275efa6df to 559c5a47ad 2023-10-25 10:25:20 +00:00 Compare
Owner

Before merging, we should look through the list of exporters and choose the ones we like. CC @dsimon

If we want to start a discussion we should open.an issue

> Before merging, we should look through the list of exporters and choose the ones we like. CC @dsimon If we want to start a discussion we should open.an issue
Gonne added 1 commit 2023-10-27 14:01:49 +00:00
Author
Owner

Solved by offline discussion.

Solved by offline discussion.
Gonne force-pushed prometheusNodeExporter from 1b1bf736db to 1de0d32860 2023-10-27 14:15:10 +00:00 Compare
Gonne changed title from WIP: Enable prometheus node exporter by default to Enable prometheus node exporter by default 2023-10-28 05:47:39 +00:00
nerf requested changes 2023-10-28 14:32:55 +00:00
nerf left a comment
Owner

I don't know much about prometheus and which options should be set and which not. I would love if @dsimon could leave a comment here about that.

I don't know much about prometheus and which options should be set and which not. I would love if @dsimon could leave a comment here about that.
@ -56,3 +57,3 @@
};
};
};
};
Owner

This is the closing bracket of the services record, there should be less white space, right?

This is the closing bracket of the services record, there should be less white space, right?
Gonne marked this conversation as resolved
@ -0,0 +1,36 @@
{
Owner

At least there should be persistence for
/var/lib/${config.services.prometheus.stateDir}
(maybe the config name is slightly wrong better look it up.)

At least there should be persistence for `/var/lib/${config.services.prometheus.stateDir}` (maybe the config name is slightly wrong better look it up.)
Gonne marked this conversation as resolved
@ -0,0 +5,4 @@
port = 9100;
# Aligned with https://git.rwth-aachen.de/fsdmath/server/prometheus/-/blob/main/node_exporter/etc/default/prometheus-node-exporter
# Original reasons are for these lists are unknown, but along the lines
# “This looks useless for VMs, but that seems nice.”
Owner

Maybe we should rethink this list. Even if we end up with the same list, but a less
shitty justification.

Maybe we should rethink this list. Even if we end up with the same list, but a less shitty justification.
Owner

Initially the list was created with the following three aspects in mind:

  1. Does the current Debian release supports the collector (because some weren't supported)?
  2. Is the collector depracated in the latest release?
  3. Could you probably use the collected metrics for monitoring in a useful manner? Or are they useless because they make no sense in our context (e.g. power adapter inside a VM, use fibre port connection)?

Because the scraped metrics are not saved on the monitored host itself but on the monitoring host (Cthugha) for a limited time span, space allocation should not be considered a problem. You could restrict the collectors only to the actually used ones, but because rolling out those configurations on all hosts (especially those running without Nix) is a pain (which would be necessary every time you want to use a currently unexported metric) and the export itself doesn't harm, I think the current set of exporters is fine.

Initially the list was created with the following three aspects in mind: 1. Does the current Debian release supports the collector (because some weren't supported)? 2. Is the collector depracated in the latest release? 3. Could you probably use the collected metrics for monitoring in a useful manner? Or are they useless because they make no sense in our context (e.g. power adapter inside a VM, use fibre port connection)? Because the scraped metrics are not saved on the monitored host itself but on the monitoring host (Cthugha) for a limited time span, space allocation should not be considered a problem. You could restrict the collectors only to the actually used ones, but because rolling out those configurations on all hosts (especially those running without Nix) is a pain (which would be necessary every time you want to use a currently unexported metric) and the export itself doesn't harm, I think the current set of exporters is fine.
Owner

Does cthugha needs support for them? Because if not 1.) is not the right constraint for nix machines.

Also if that is the real justification, that should be what the comment says and not that they kinda looked nice.

Does cthugha needs support for them? Because if not 1.) is not the right constraint for nix machines. Also if that is the real justification, that should be what the comment says and not that they kinda looked nice.
Owner

It depends on what you mean by "support for them":

  1. If you mean support as in can handle the additional data (which I think is the intended meaning), then the answer is no, cthugha does not need additional support for those metrics and can handle them automatically.
  2. If you mean support as in can use that new data automatically in a useful manner, then the answer is yes, cthugha needs to be explicitly configured. In that case, the prometheus instance needs some new (alerting) rules which use those newly exported metrics.
    Because
    The justification in the comment could actually be better, but is not as wrong as your answer could suggest. The initial selection of collectors is somewhat arbitrary. In fact I'm not using all exported metrics for alerting, but only a (small to medium sized) subset of the available ones. Some metrics are actively used in alerting (e.g. collector.systemd, collector.time and collector.textfile) while could be used in a nice way in future setups (e.g. collector.mdadm, collector.cpufreq and collector.nfsd). The latter example enables the export of metrics about software RAIDs (mdadm) and NFS servers (nfsd), which are things definitly not used at the moment in our infrastructure (at least to my knowledge), but probably could be used in future use case. Explicitly disabled are only exporters which are not useful at all, because of the VM environment, the constant resolution of arp information within our internal network (monitoring ARP tables in a constant server network doesn't make any sense to me), the absence of required hardware or because of performance problems (udp_queue).
    I would suggest keeping the list roughly symmetric, but adding the new collectors not available in Debian Bullseye. Shrinking the list of available metrics (in comparison to other hosts) could lead to unexpected results like alerts not firing for a specific host because the used metrics from that host were not available (absence of metrics -> empty dataset for that host -> no data to compare rule against -> no alert firing in case of problem).
It depends on what you mean by "support for them": 1. If you mean support as in can handle the additional data (which I think is the intended meaning), then the answer is no, cthugha does not need additional support for those metrics and can handle them automatically. 2. If you mean support as in can use that new data automatically in a useful manner, then the answer is yes, cthugha needs to be explicitly configured. In that case, the prometheus instance needs some new (alerting) rules which use those newly exported metrics. Because The justification in the comment could actually be better, but is not as wrong as your answer could suggest. The initial selection of collectors is somewhat arbitrary. In fact I'm not using all exported metrics for alerting, but only a (small to medium sized) subset of the available ones. Some metrics are actively used in alerting (e.g. `collector.systemd`, `collector.time` and `collector.textfile`) while could be used in a nice way in future setups (e.g. `collector.mdadm`, `collector.cpufreq` and `collector.nfsd`). The latter example enables the export of metrics about software RAIDs (`mdadm`) and NFS servers (`nfsd`), which are things definitly not used at the moment in our infrastructure (at least to my knowledge), but probably could be used in future use case. Explicitly disabled are only exporters which are not useful at all, because of the VM environment, the constant resolution of arp information within our internal network (monitoring ARP tables in a constant server network doesn't make any sense to me), the absence of required hardware or because of performance problems (udp_queue). I would suggest keeping the list roughly symmetric, but adding the new collectors not available in Debian Bullseye. Shrinking the list of available metrics (in comparison to other hosts) could lead to unexpected results like alerts not firing for a specific host because the used metrics from that host were not available (absence of metrics -> empty dataset for that host -> no data to compare rule against -> no alert firing in case of problem).
Owner

With “support for them” I meant if this only applies to the host with the exporter or also to cthuga (which is a debian machine right?)

Does the current Debian release supports the collector (because some weren't supported)?

So you say it is advisable to have on all machines the same list of exporters and not enable exporters selectively?

With “support for them” I meant if this only applies to the host with the exporter or also to cthuga (which is a debian machine right?) > Does the current Debian release supports the collector (because some weren't supported)? So you say it is advisable to have on all machines the same list of exporters and not enable exporters selectively?
Owner

Side note: The list here currently disables collectors explicitly enabled in [1], like nfs. Because NFS is not used at the moment (neither the protocol nor the exported metric), this should not be a problem. But at least the new list is not a super-set of the original list.

[1] https://git.rwth-aachen.de/fsdmath/server/prometheus/-/blob/main/node_exporter/etc/default/prometheus-node-exporter

Side note: The list here currently disables collectors explicitly enabled in [1], like `nfs`. Because NFS is not used at the moment (neither the protocol nor the exported metric), this should not be a problem. But at least the new list is not a super-set of the original list. [1] https://git.rwth-aachen.de/fsdmath/server/prometheus/-/blob/main/node_exporter/etc/default/prometheus-node-exporter
Gonne added 1 commit 2023-10-30 16:25:19 +00:00
nerf requested changes 2023-10-31 10:48:03 +00:00
@ -39,6 +39,7 @@ config = mkIf cfg.enable {
"/etc/ssh/ssh_host_ed25519_key.pub"
"/etc/ssh/ssh_host_rsa_key"
"/etc/ssh/ssh_host_rsa_key.pub"
"/var/lib/${config.services.prometheus.stateDir}"
Owner

I think that should go together with the prometheus config and not here, because if you we would comment
out the prometheus config we would also want this to be not persistent. So I think it should be reachable from the
same import.

Maybe there are better reasons to to it like this, then convince me please

I think that should go together with the prometheus config and not here, because if you we would comment out the prometheus config we would also want this to be not persistent. So I think it should be reachable from the same import. Maybe there are better reasons to to it like this, then convince me please
Gonne marked this conversation as resolved
Gonne was assigned by nerf 2023-11-01 10:56:00 +00:00
Gonne force-pushed prometheusNodeExporter from a140b4d9ec to 3da04e80ae 2023-11-03 15:39:56 +00:00 Compare
Gonne force-pushed prometheusNodeExporter from 3da04e80ae to 2c2b24d0a9 2023-11-03 15:43:33 +00:00 Compare
Gonne added 1 commit 2023-11-03 15:46:41 +00:00
Author
Owner

Side note: The list here currently disables collectors explicitly enabled in [1], like nfs. Because NFS is not used at the moment (neither the protocol nor the exported metric), this should not be a problem. But at least the new list is not a super-set of the original list.

[1] https://git.rwth-aachen.de/fsdmath/server/prometheus/-/blob/main/node_exporter/etc/default/prometheus-node-exporter

Fixed.

> Side note: The list here currently disables collectors explicitly enabled in [1], like `nfs`. Because NFS is not used at the moment (neither the protocol nor the exported metric), this should not be a problem. But at least the new list is not a super-set of the original list. > > [1] https://git.rwth-aachen.de/fsdmath/server/prometheus/-/blob/main/node_exporter/etc/default/prometheus-node-exporter Fixed.
Gonne added 1 commit 2023-11-06 11:48:35 +00:00
Gonne requested review from nerf 2023-11-06 11:49:21 +00:00
nerf approved these changes 2023-11-06 14:30:17 +00:00
nerf merged commit b2f0945473 into main 2023-11-06 14:30:33 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Fachschaft/nixConfig#19
No description provided.