Integration

Is my POD alive? A quick guide into alerts in Openshift

Since around version 3.11, Openshift provides a complete and robust monitoring stack integrating very popular open source projects like Grafana and Prometheus.  So how can you take advantage of this features? In this quick guide we will see how to create simple alerts in Openshift that will fire if a pod is not running anymore.

1) An alert is a metric

This statement is of course wrong, very wrong. But in the context of Openshift this makes more sense than you think. Alerts in Openshift are fired when an observable metric fulfils a condition. So, in order to define an alert, you will have to choose the metric(s) you want to inspect. And in case of Openshift, there are a lot:

alerts in openshift

2) Defining the alert

Once you have chosen the metrics, it is time to build an expression. Since we want to monitor if our pods are alive, we decided to use the “openshift_deploymentconfig_spec_replicas”. This metric tells us how many pods are running for a given deployment config. It works perfectly for our case because it fits our scenario, but you can perfectly use other kube-* metrics like “kube_pod_status_phase”. The alert will then use the following expression:

openshift_deploymentconfig_spec_replicas{deploymentconfig=~"(event-producer|test1-event-processor)"} == 0

The expression can be translated into something like: “Fire every time the replicas for the event-producer, first-event-processor or second-event-processor deployment-config reaches 0”.

The Yaml file for the alert in Openshift will look like this:

apiVersion: monitoring.coreos.com/v1
        kind: PrometheusRule
        metadata:
          namespace: my-test-project
          name: is-pod-alive-rule
        spec:
          groups:
          - name: pod-alive-rules
            rules:
            - alert: is-test-pod-alive
              annotations:
                description: if {{ $labels.namespace }}/{{ $labels.deploymentconfig }} failed to complete.
                summary: Targets are down
              expr: openshift_deploymentconfig_spec_replicas{deploymentconfig=~"(event-producer|test1-event-processor)"} == 0
              labels:
                severity: critical

3) Monitor your own services:

The alert from above will be created on our own namespace (project). But before we can do it, we have to activate the monitoring of user-defined projects. To do this, you will have to create/modify the cluster-monitoring-config config map inside the “openshift-monitoring” project:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

Important is that the “enableUserWorkload” is set to true.

4) Adding the alert

Ok now that we are set, you can create your alert by executing the following command:

oc apply -f ./resources/pod_alerting_fluez.yaml

and we can see it in the monitoring console:

alert being fired

5) Pimp my alert

Now that we have a basic alert running, we can make things more interesting. First, we want to be alerted when a POD is not running, but not every time. For instance,  we don’t want to be alerted when a redeployment is taken place, because there is nothing unusual happening, even though the pod is not alive. For these cases we can use the “for” option:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  namespace: my-test-project
  name: is-pod-alive-rule
spec:
  groups:
  - name: pod-alive-rules
    rules:
    - alert: is-test-pod-alive
      annotations:
        description: if {{ $labels.namespace }}/{{ $labels.deploymentconfig }} failed to complete.
        summary: Targets are down
      expr: openshift_deploymentconfig_spec_replicas{deploymentconfig=~"(event-producer|test1-event-processor)"} == 0
      for: 5m
      labels:
        severity: critical

So now our alert can be translated into something like: “Fire every time the replicas for the event-producer, first-event-processor or second-event-processor deployment-config have been 0 for the last 5 Minutes”. If we recreate the rule and turn down our deployments:

deleting_pods

We can see after some time (less than 5 minutes of course) that our rule is pending to be executed

pending_alertAnd after the 5 minutespending_alert_firingAnother good recommendation is to add a dead man switch as proposed by Mateo Burillo in his Blog. The idea is to have a rule that will always fire in order to check if your rules are active. The dead man switch can help you to see if your rules are in order. So, our rule file will look like this:

apiVersion: monitoring.coreos.com/v1
        kind: PrometheusRule
        metadata:
          namespace: my-test-project
          name: is-pod-alive-rule
        spec:
          groups:
          - name: pod-alive-rules
            rules:
            - alert: is-test-pod-alive
              annotations:
                description: if {{ $labels.namespace }}/{{ $labels.deploymentconfig }} failed to complete.
                summary: Targets are down
              expr: openshift_deploymentconfig_spec_replicas{deploymentconfig=~"(event-producer|test1-event-processor)"} == 0
              for: 5m
              labels:
                severity: critical
            - alert: DeadMansSwitch-serviceprom
              annotations:
                description: This is a DeadMansSwitch meant to ensure that the entire Alerting
                  pipeline is functional.
                summary: Alerting DeadMansSwitch
              expr: vector(1)
              labels:
                severity: info

Summary

There are a lot of external tools, framework, suites and libraries that can be used in order to monitor our applications. Fortunately for us, Openshift already provides all the needed artifacts, and they can be easily be integrated. The idea of this blog is to help you get started with your first alerts so that you can build more robust and complex ones.

References

Kubernetes monitoring with Prometheus, Mateo Burillo

Monitoring Applications in OpenShift using Prometheus, Michael Kotelnikov

Understanding the monitoring stack, Red Hat

Managing alerts, Red Hat