Prometheus is great when it comes to alerting, it’s also quite easy to setup in highly available configuration. However, from economic point of view, it’s sometimes not a solution as Prometheus, depending on number of targets and metrics, may require quite a lot of resources. So it’s common to run non-HA setup in dev and testing environments. But no matter the environment we run we probably want to make sure that our alerting pipeline is healthy and get notified when it doesn’t.
For some alerting receivers in Alertmanager like PagerDuty we have such feature built-in. Prometheus provides a special alert – DeadMansSwitch – which reports failed state all the time, thus Alertmanager forwards it every repeat_interval. PagerDuty has an ability to configure sending alerts to operators when there’s no failure reported on the DeadMansSwitch alert, so in case of alerting pipeline failure operators are immediately notified about that fact.
In this post I introduce one of my last open source project called prometheus-alertmanager-cloudwatch-webhook. The project provides a binary (and a container) to run an application that exposes alertmanager-compatible webhook. When the webhook is invoked it sets a metric in CloudWatch. Provided the metric is already in CloudWatch we can setup CloudWatch Alert to watch that metric and send notifications on changes.
The deployment of the app is basically it’s just a matter of applying some manifests and creating AWS role. The role policy is described in the README, here’s how to deploy manifests:
git clone [email protected]:tomaszkiewicz/prometheus-alertmanager-cloudwatch-webhook.git
cd prometheus-alertmanager-cloudwatch-webhook/build/k8s
kustomize build | kubectl apply -f -
After you deploy the app you have to point Alertmanager to send DeadMansSwitch alert state to the webhook – a sample configuration:
route:
receiver: "slack"
group_by:
- severity
- alertname
routes:
- receiver: "cloudwatch"
match:
alertname: DeadMansSwitch
group_wait: 30s
group_interval: 1m
repeat_interval: 1m
group_wait: 30s
group_interval: 1m
repeat_interval: 48h
receivers:
- name: "cloudwatch"
webhook_configs:
- url: "http://alertmanager-cloudwatch-webhook/webhook"
- name: "slack"
...
As you can see we use additional route to filter only DeadMansSwitch alert and send it to a webhook, we also specify a repeat_interval of one minute to make sure CloudWatch gets metric updated regardless general setting of repeat_interval. That’s important, as we want to get notified quickly is something is wrong.
Next we configure a receiver by declaring it with webhook_configs parameter pointing to just deployed service.
Now, apply that config changes to Alertmanager and you should see new metric in CloudWatch in a minute.
Next step is to configure CloudWatch Alert, here’s example Terraform resource:
resource "aws_cloudwatch_metric_alarm" "p8s_dead_mans_switch" {
alarm_name = "prometheus-alertmanager-pipeline-health"
alarm_description = "This metric shows health of alerting pipeline"
comparison_operator = "LessThanThreshold"
evaluation_periods = "5"
metric_name = "DeadMansSwitch"
namespace = "Prometheus"
period = "60"
statistic = "Minimum"
threshold = "1"
treat_missing_data = "breaching"
alarm_actions = ["${module.slack_alarm_notification.sns_topic_arn}"]
ok_actions = ["${module.slack_alarm_notification.sns_topic_arn}"]
}
In my example I redirect all alerts to a module that handles Slack notifications from CloudWatch. Of course you can customize it to your case and your alerting infrastructure.
I hope you like the project and find it useful :)
Leave a Comment