#AzureMonitor: Heartbeat failure alerts

One of the questions we receive regularly is how to use the Azure Monitor components to alert on machines that are not available, and then how to create availability reports using these tools.

My colleague Anders and I have been looking at the best ways of achieving this in a way that those who are migrating from tools like System Center Operations Manager would be familiar and comfortable with.

As the monitoring agent used by Azure Monitor on both Windows and Linux sends a heartbeat every minute, the easiest method to detect a server down event, regardless of server location, would be to alert on missing heartbeats. This means you can use one alert rule to notify for heartbeat failures, even if machines are hosted on-prem.

Log Ingestion Time and Latency

Before we look at the technical detail, it is worth calling out the Log Ingestion Time for Azure Monitor. This is particularly important if you are expecting heartbeat missed notifications within a specific time frame. In this article, the following query is shared:

Heartbeat
| where TimeGenerated > ago(8h)
| extend E2EIngestionLatency = ingestion_time() – TimeGenerated
| summarize percentiles(E2EIngestionLatency,50,95) by Computer
| top 20 by percentile_E2EIngestionLatency_95 desc


Which you can use to view the computers with the highest ingestion time over the last 8 hours. This can help you plan out the thresholds for the alerting settings.

Alerting

You can use the following query in Logs to retrieve machines that have not sent a heartbeat in the last 5 minutes:

Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer
| where LastHeartbeat < ago(5m)


You can adjust this based on the results of the previous query as appropriate for your environment.

And this is great for reporting and dashboarding, but we have found using the Heartbeat Metric in the alert rule fields better results. Read more about Metrics here.

To create such a rule, you want to target the Workspace resource still, but, in the condition, you want to use the Heartbeat metric signal:

clip_image002

You will now be able to configure the alert options.

clip_image004

  1. Select the computers to alert on. You can choose Select All
  2. Change to Less or equal to, and enter 0 as your threshold value
  3. Select your aggregation granularity and frequency

The best results we have found during testing is an alert within 2 minutes of a machine shut down, with the above settings – keeping the ingestion and latency in mind.

Using these settings, you should get an alert for each unavailable machine within a few minutes after it becomes unavailable. But, as the signal relies on the heartbeat of the agent, this may also alert during maintenance times, or if the agent is stopped.

If you need an alert quickly, and you are not concerned with an alert flood, then use these settings.

However, if you want to ensure that you only alert on valid server outages, you may want to take a few additional steps. You can use Azure Automation Runbooks or Logic Apps as an alert response to perform some additional diagnostic steps, and trigger another alert based on the output. This could replicate the method used in SCOM with a Heartbeat Failure alert and a Failed to Connect alert.

If you are only monitoring Azure Hosted virtual machines, you could also use the Activity Log to look for Server Shutdown events, using the following query:

AzureActivity
| where OperationName == “Deallocate Virtual Machine” and ActivityStatus == “Succeeded”
| where TimeGenerated > ago(5m)

Reporting

Conversations about server unavailable alerts invariably lead to questions around the ability to report on Server Update/Availability.

In the Logs blade, there are a few sample queries available relating to availability:

clip_image006

With the Availability rate query by default returning the availability for monitored virtual machines for the last hour, but also providing you with an availability rate query that you can build on.

This can be updated to show the availability for the last 30 days as follows:

let start_time=startofday(now()-30d);
let end_time=now();
Heartbeat
| where TimeGenerated > start_time and TimeGenerated < end_time
| summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer
| extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
| summarize total_available_hours=countif(available_per_hour==true) by Computer
| extend total_number_of_buckets=round((end_time-start_time)/1h)+1
| extend availability_rate=total_available_hours*100/total_number_of_buckets

Or, if you are storing more than 1 month of data, you can also modify the query to run for the previous month:

let start_time=startofmonth(datetime_add(‘month’,-1,now()));
let end_time=endofmonth(datetime_add(‘month’,-1,now()));
Heartbeat
| where TimeGenerated > start_time and TimeGenerated < end_time
| summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 5m, start_time), Computer
| extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
| summarize total_available_hours=countif(available_per_hour==true) by Computer
| extend total_number_of_buckets=round((end_time-start_time)/5m)+1
| extend availability_rate=total_available_hours*100/total_number_of_buckets


These queries can be used in a Workbook to create an availability report

clip_image008

Note that the availability report is based on heartbeats, not the actual service running on the server. For example, if multiple servers are part of an availability set or a cluster, the service might still be available even if one server is unavailable.

Further Reading:

One thought on “#AzureMonitor: Heartbeat failure alerts

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s