One of the questions we receive regularly is how to use the Azure Monitor components to alert on machines that are not available, and then how to create availability reports using these tools.
My colleague Anders and I have been looking at the best ways of achieving this in a way that those who are migrating from tools like System Center Operations Manager would be familiar and comfortable with.
As the monitoring agent used by Azure Monitor on both Windows and Linux sends a heartbeat every minute, the easiest method to detect a server down event, regardless of server location, would be to alert on missing heartbeats. This means you can use one alert rule to notify for heartbeat failures, even if machines are hosted on-prem.
Log Ingestion Time and Latency
Before we look at the technical detail, it is worth calling out the Log Ingestion Time for Azure Monitor. This is particularly important if you are expecting heartbeat missed notifications within a specific time frame. In this article, the following query is shared:
Which you can use to view the computers with the highest ingestion time over the last 8 hours. This can help you plan out the thresholds for the alerting settings.
You can use the following query in Logs to retrieve machines that have not sent a heartbeat in the last 5 minutes:
You can adjust this based on the results of the previous query as appropriate for your environment.
And this is great for reporting and dashboarding, but we have found using the Heartbeat Metric in the alert rule fields better results. Read more about Metrics here.
You will now be able to configure the alert options.
- Select the computers to alert on. You can choose Select All
- Change to Less or equal to, and enter 0 as your threshold value
- Select your aggregation granularity and frequency
The best results we have found during testing is an alert within 2 minutes of a machine shut down, with the above settings – keeping the ingestion and latency in mind.
Using these settings, you should get an alert for each unavailable machine within a few minutes after it becomes unavailable. But, as the signal relies on the heartbeat of the agent, this may also alert during maintenance times, or if the agent is stopped.
If you need an alert quickly, and you are not concerned with an alert flood, then use these settings.
However, if you want to ensure that you only alert on valid server outages, you may want to take a few additional steps. You can use Azure Automation Runbooks or Logic Apps as an alert response to perform some additional diagnostic steps, and trigger another alert based on the output. This could replicate the method used in SCOM with a Heartbeat Failure alert and a Failed to Connect alert.
If you are only monitoring Azure Hosted virtual machines, you could also use the Activity Log to look for Server Shutdown events, using the following query:
Conversations about server unavailable alerts invariably lead to questions around the ability to report on Server Update/Availability.
In the Logs blade, there are a few sample queries available relating to availability:
With the Availability rate query by default returning the availability for monitored virtual machines for the last hour, but also providing you with an availability rate query that you can build on.
This can be updated to show the availability for the last 30 days as follows:
Or, if you are storing more than 1 month of data, you can also modify the query to run for the previous month:
These queries can be used in a Workbook to create an availability report
Note that the availability report is based on heartbeats, not the actual service running on the server. For example, if multiple servers are part of an availability set or a cluster, the service might still be available even if one server is unavailable.