Alerting on Failed or Long Running Cronjobs with Kubernetes, Prometheus

Alerting on Failed or Long Running Cronjobs with Kubernetes, Prometheus

I think one of the more poorly documented and/or talked about resources in Kubernetes are cronjobs, and the buck doesn’t stop at just configuring your cronjob manifest. I have also seen a lot of wrong, outdated or overly complex Prometheus queries to monitor your cronjobs. At Lumo, we run a lot of cronjobs, so it’s important to be accurately alerted on failures or longer-than-average runtimes.

Bottom line for my cronjob alerting philosophy is: since all cronjobs have different schedules and different runtimes, its impossible to create one or two alerts that work for all your cronjobs – so create them individually. Who cares if you have large rules files? In terms of alerting, I find the two most important things to watch for are:

  1. Failed cronjobs (cronjobs that complete in an Error state)
  2. Cronjobs that have been running too long – i.e. your job usually finishes in x minutes but for some reason it’s taking x+n to complete.

Important note about case #1 – there’s a bug (which has been patched) that when `.spec.template.spec.restartPolicy` field is set to “`OnFailure`”, the back-off limit may be ineffective. As a short-term workaround, set the restart policy for the embedded template to “`Never`”.

Additionally, make sure you have your concurrencyPolicy and restartPolicy set correctly. It’s also important that whatever you are running in the cronjob exits correctly (whether it’s an error or success). I usually use these policy settings, but please read the documentation as your case may vary:

.spec.concurrencyPolicy: Forbid
.spec.jobTemplate.spec.restartPolicy: Never

Setting the following are useful for debugging or just seeing job history in kubectl (again, read the docs about these settings):

.spec.failedJobsHistoryLimit: 2
.spec.successfulJobsHistoryLimit: 1

Alerting on failed or long-running cronjobs

So now to the Prometheus alert rules for failed or long-running cronjobs. This is what I’ve come up with and found to be useful:

Alert when a cronjob has been running longer than it should
- alert: MyCronjobRunningTooLong
  expr: max(abs(kube_job_status_start_time{job=~"my-cronjob.*"} - kube_job_status_completion_time{job=~"my-cronjob.*"})) > 90
  for: 1m
  labels:
    app: my-cronjob
    severity: warning
  annotations:
    description: 'my-cronjob has been running for {{ $value }} seconds (it averages 70s)'
    summary: 'my-cronjob has been running for {{ $value }} seconds (it averages 70s)'

If you are unsure how long your jobs usually run, you can easily graph the above expression (remove the condition `> 90`) with Prometheus itself or Grafana. This alert will also be more sensitive by using the `max()` function – if you want it to be less sensitive to spikes in runtime, use `avg()`.

Alert when a cronjob terminates with a Failure
- alert: MyCronjobFailed
  expr: sum(rate(kube_job_status_failed{job=~"my-cronjob.*"}[10m])) > 0
  for: 1m
  labels:
    app: my-cronjob
    severity: warning
  annotations:
    description: 'my-cronjob is failing!'
    summary: 'my-cronjob is failing!'

The rate timeframe on the Failure alert should match the interval at which your cronjob runs. In this case, the cronjob runs every 10 minutes thus I’ve summarized accordingly so it will only look for failures in the last 10 minutes. If your use-case is different, this might be something you would want to modify. Additionally tweak the `for` field if you expect failures sometimes but not all the times – sometimes our jobs fail due to flaky third-party services but succeed on the next run.

Also, note the `.*` in the job match – you need this when using a `kube_job` function because a timestamp is appended to the cronjob pods. If you are using a `kube_cronjob` function in Prometheus, the cronjob should appear as `cronjob=my-cronjob` without a timestamp.

Conclusion

The next step from here is pushing custom metrics using the PushGateway to detect more than just failures or long-running jobs. One thing I didn’t mention is `.spec.activeDeadlineSeconds` – this would effectively kill your job if it has been running too long, however I have never used it so I can’t speak to how Prometheus would recognize it. The main reason I have not implemented it is because it seemed like it may cause more harm than good for most of our use-cases.

You can also get fancy with labels and such in your descriptions/summaries or record the average runtime to make the alert description for the long-running jobs more dynamic, but I don’t really find that necessary for the relatively straightforward case here. Like I’ve mentioned above, most of the battle is knowing your cronjob – how long it usually takes to run, if it exits correctly on failures, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *