luktom.net
  • blog
  • contact
  • polish





Monitoring Prometheus alerting pipeline health using CloudWatch

On 26 May, 2019
AWS, Kubernetes, Prometheus
No Comments
Views : 4889

Prometheus is great when it comes to alerting, it’s also quite easy to setup in highly available configuration. However, from economic point of view, it’s sometimes not a solution as Prometheus, depending on number of targets and metrics, may require quite a lot of resources. So it’s common to run non-HA setup in dev and testing environments. But no matter the environment we run we probably want to make sure that our alerting pipeline is healthy and get notified when it doesn’t.

For some alerting receivers in Alertmanager like PagerDuty we have such feature built-in. Prometheus provides a special alert – DeadMansSwitch – which reports failed state all the time, thus Alertmanager forwards it every repeat_interval. PagerDuty has an ability to configure sending alerts to operators when there’s no failure reported on the DeadMansSwitch alert, so in case of alerting pipeline failure operators are immediately notified about that fact.

In this post I introduce one of my last open source project called prometheus-alertmanager-cloudwatch-webhook. The project provides a binary (and a container) to run an application that exposes alertmanager-compatible webhook. When the webhook is invoked it sets a metric in CloudWatch. Provided the metric is already in CloudWatch we can setup CloudWatch Alert to watch that metric and send notifications on changes.

The deployment of the app is basically it’s just a matter of applying some manifests and creating AWS role. The role policy is described in the README, here’s how to deploy manifests:

git clone [email protected]:tomaszkiewicz/prometheus-alertmanager-cloudwatch-webhook.git
cd prometheus-alertmanager-cloudwatch-webhook/build/k8s
kustomize build | kubectl apply -f -

After you deploy the app you have to point Alertmanager to send DeadMansSwitch alert state to the webhook – a sample configuration:

route:
  receiver: "slack"
  group_by:
  - severity
  - alertname
  routes:
  - receiver: "cloudwatch"
    match:
      alertname: DeadMansSwitch
    group_wait: 30s
    group_interval: 1m
    repeat_interval: 1m
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 48h

receivers:
- name: "cloudwatch"
  webhook_configs:
  - url: "http://alertmanager-cloudwatch-webhook/webhook"
- name: "slack"
...

As you can see we use additional route to filter only DeadMansSwitch alert and send it to a webhook, we also specify a repeat_interval of one minute to make sure CloudWatch gets metric updated regardless general setting of repeat_interval. That’s important, as we want to get notified quickly is something is wrong.

Next we configure a receiver by declaring it with webhook_configs parameter pointing to just deployed service.

Now, apply that config changes to Alertmanager and you should see new metric in CloudWatch in a minute.

Next step is to configure CloudWatch Alert, here’s example Terraform resource:

resource "aws_cloudwatch_metric_alarm" "p8s_dead_mans_switch" {
  alarm_name = "prometheus-alertmanager-pipeline-health"
  alarm_description = "This metric shows health of alerting pipeline"
  comparison_operator = "LessThanThreshold"
  evaluation_periods = "5"
  metric_name = "DeadMansSwitch"
  namespace = "Prometheus"
  period = "60"
  statistic = "Minimum"
  threshold = "1"
  treat_missing_data = "breaching"
  alarm_actions = ["${module.slack_alarm_notification.sns_topic_arn}"]
  ok_actions = ["${module.slack_alarm_notification.sns_topic_arn}"]
}

In my example I redirect all alerts to a module that handles Slack notifications from CloudWatch. Of course you can customize it to your case and your alerting infrastructure.

I hope you like the project and find it useful :)



Tags :   alertalertingalertmanagercloudwatchk8skubernetesprometheus

Related Posts

  • ArgoCD vs Flux

  • Ansible Operators – let’s give them a spin

  • How to (and why) replace AWS CNI with Calico on AWS EKS cluster

  • “GitOps – introduction, tools and best practices” – an invitation to my speech

  • Leave a Comment

    Click here to cancel reply

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>





    Łukasz Tomaszkiewicz

    Łukasz Tomaszkiewicz

    Łukasz Tomaszkiewicz is a highly skilled and passionate cloud expert who loves to automate repeatable things and secure them.

    His broad experience in the areas of software development, database design, containerization and cloud infrastructure management gives him a holistic view of a modern technology stack.

    In his spare time he enjoys photography, blogging and speaking on local IT-related communities.

    Vim-believer :)

    Categories

    • Ansible
    • AWS
    • C#
    • Go
    • Google Cloud
    • Kubernetes
    • Prometheus
    • Speeches
    • Virtualization
    • Windows

    Tags

    alert alerting alertmanager ansible ansible operator argocd aws aws cli aws ug bash c# centos cloudwatch databases esxi flux gcp gitops google cloud k8s kubernetes linux mysql open source operator operator-sdk policies powershell prelekcje prometheus recovery restore rhel rpo rto scp speeches terraform virtualization vmware vsan vsphere weaveworks wifi windows

    Copyright © 2006-2018 by Łukasz Tomaszkiewicz. Wszelkie prawa zastrzeżone