Azure Monitor is the single umbrella service for all monitoring in Azure — metrics, logs, alerts, dashboards, and insights. Everything flows through it.
Azure Monitor is the observability platform for everything in Azure. Before it, different services had separate monitoring tools. VMs had Log Analytics, App Service had Application Insights, Activity Log was separate. Azure Monitor is the umbrella that consolidates everything. Two fundamental types of monitoring data. Metrics are numerical measurements collected at regular intervals. CPU percentage, memory usage, requests per second. Every Azure resource generates metrics automatically without configuration. Available within one minute. Stored for 93 days. Use for dashboards, autoscale triggers, and threshold alerts. Logs are structured records capturing events, operations, and errors. An HTTP 500 error with stack trace. A failed authentication with username and IP. An NSG rule allowing or blocking a connection. Critical difference from metrics: logs are NOT collected automatically for most resources. You must configure Diagnostic Settings on each resource to route its logs somewhere. Without Diagnostic Settings, those logs are generated and immediately discarded. This causes real-world incidents constantly. An organisation gets compromised, investigations begin, and they discover Key Vault access logs were never enabled. All evidence of what data was accessed is gone. Enable Diagnostic Settings on critical resources before you need the logs, not after. Log Analytics Workspace is where logs go. It is a queryable database using Kusto Query Language, KQL. Simple query: show me all failed sign-ins in the last hour. Complex query: which VMs had CPU above 80% for more than 10 minutes in the last 24 hours. Alerts fire when conditions you define are met. Alert rule has three parts: scope is what resource to watch, condition is what signal and threshold triggers it, and action group is what to do when it fires. Action groups can send email, SMS, call webhooks, trigger Logic Apps or Functions, or create ITSM tickets. Action Groups are reusable. One action group attached to 20 different alert rules. When any fires, the same team gets notified. Activity Log captures all control-plane operations, who created, modified, or deleted resources, retained for 90 days automatically.. Remember: Metrics are automatic, numerical, 93-day retention, near-realtime. Logs need Diagnostic Settings, stored in Log Analytics, queried with KQL. Alerts need scope, condition, and action group. Action Groups are reusable.
📖 Azure Monitor — Complete Explanation
Azure Monitor is the unified observability platform for everything running in Azure (and on-premises via Azure Arc). Before Azure Monitor, different services had separate monitoring tools — VMs had Log Analytics, App Service had Application Insights, Activity Log was separate. Azure Monitor is the umbrella that consolidates everything.
Two types of data — Metrics and Logs: Metrics are lightweight numeric measurements collected every minute automatically for every Azure resource. They answer "what is the current state?" (CPU is 75%, 3 requests/second). Logs are rich structured records capturing events, errors, and audit trails. They answer "what happened and why?" (Error 500 at 14:32:01, triggered by request from 203.x.x.x).
Why you can't query Metrics with KQL: Metrics go into a dedicated time-series database optimised for fast numeric retrieval and graphing. Logs go into Log Analytics workspace where Kusto Query Language (KQL) runs complex aggregations and joins. Different stores for different purposes.
The diagnostic settings gap: Azure resources generate platform metrics automatically — you always have CPU, memory, disk IOPS without any configuration. But resource-specific logs (what keys were accessed in Key Vault, which requests failed in App Service, which NSG rules matched traffic) are NOT collected by default. You must explicitly configure Diagnostic Settings on each resource to route these logs to a Log Analytics workspace, Storage Account, or Event Hub. This is a very common real-world mistake — organisations assume everything is being logged, then discover after a security incident that resource logs were never enabled.
Action Groups are reusable: One Action Group (e.g., "ag-kube-oncall") can be attached to 50 different alert rules. When any of those alerts fire, the same team gets notified. This is the correct design — not creating separate notification configs per alert.
🏥
The Metaphor
Azure Monitor is like a hospital monitoring system. Every patient (resource) has sensors collecting real-time vitals (Metrics) and a medical notes file (Logs). Doctors (alerts) watch for danger thresholds. When something goes wrong, the alarm sounds (alert fires) and the response team (Action Group) is automatically paged.
Azure Monitor — Data Flow Architecture
Metrics vs Logs — Key Difference
Metrics
Numerical Time-Series — Real-Time
Numeric values collected at regular intervals (CPU %, memory, requests/sec).
Stored in a dedicated time-series database for 93 days.
Fast — available within 1 minute. Use for dashboards, autoscale triggers, threshold alerts.
Stored in Log Analytics Workspace. Retention: 30 days free, up to 2 years paid.
Queried with KQL (Kusto Query Language). Slower than metrics — seconds to minutes.
Resource logs (diagnostic logs) require Diagnostic Settings to route to workspace.
Log Analytics Workspace
Central Log Store
One Workspace — Many Sources
Log Analytics Workspace (LAW) = central store for all Azure logs. VMs, App Services, NSG Flow Logs, Activity Logs, Entra ID Sign-In logs — all sent here.
KQL example — find failed sign-ins: SigninLogs | where ResultType != 0 | summarize count() by UserPrincipalName
Agents: Azure Monitor Agent (AMA) — current standard. Replaces legacy MMA/OMS agent. Must be installed on VMs to collect guest OS metrics and logs.
KQL — Useful Queries for AZ-104
-- VMs with high CPU in last hour
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(1h)
| summarize avg(CounterValue) by Computer
| where avg_CounterValue > 80
| order by avg_CounterValue desc
-- Failed sign-ins by user
SigninLogs
| where ResultType != "0"
| summarize FailCount=count() by UserPrincipalName
| order by FailCount desc
-- All activity in a resource group today
AzureActivity
| where ResourceGroup == "RG-Kube-Prod"
| where TimeGenerated > startofday(now())
| project TimeGenerated, Caller, OperationName, ActivityStatus
| order by TimeGenerated desc
Alerts & Action Groups
Alert Rule Components
Animated — Alert Firing: CPU spike → email + auto-remediation
📋
Alert Signal Types
Metric alert: fires when metric value crosses threshold. Near-real-time (1 min). Log alert: runs KQL query on schedule, fires if result count exceeds threshold. Activity log alert: fires on Azure management operations (VM deleted, RBAC changed). Smart detection: Application Insights ML-based anomaly detection.
Diagnostic Settings
Route Resource Logs
Resource Logs Don't Flow Automatically — Must Configure
By default, resource-specific logs (NSG flow logs, App Service logs, Key Vault access logs) are NOT sent anywhere. You must create Diagnostic Settings per resource to route them.
Azure Backup provides cloud-based backup for VMs, disks, files, SQL, SAP HANA, and Blob storage. Zero infrastructure to manage — backup data stored in a Recovery Services Vault or Backup Vault.
Azure Backup protects you from data loss, not from infrastructure failure. If someone accidentally deletes a file, ransomware encrypts data, or a developer drops the wrong database table, backup saves you. Backup is not your disaster recovery strategy for infrastructure outages. The structure: everything starts with a vault. For backing up VMs and SQL databases, use a Recovery Services Vault. For newer workloads like managed disks, blobs, and AKS, use a Backup Vault. The vault must be in the same region as the resources you are protecting. You cannot back up an East US VM to a West US vault. For VM backup the experience is simple. No agent installation required for Azure VMs. The backup extension installs automatically. Azure takes an application-consistent snapshot of all the VM's managed disks, transfers incremental changes to the vault, and creates a recovery point. Backup policies define schedule and retention. Take daily backups retained for 30 days, weekly retained for a year, monthly for five years, and yearly for ten years, all in one policy. Restore options give flexibility. Restore the entire VM, replacing the existing VM or creating a new one alongside it. Restore individual files by mounting the backup disk as a file share and copying only what you need. Soft delete is a safety feature enabled by default. When you delete a VM's backup, Azure does not immediately purge the data. It holds it in soft-deleted state for 14 days. During that window you can undo the deletion and recover everything. After 14 days the data is permanently gone. This protects against accidental deletion and against ransomware attackers who try to delete your backups. Never disable soft delete on production vaults. Vault redundancy: GRS replicates backup data to a paired region, surviving a complete regional outage. Default and recommended. LRS keeps backup data local, cheaper but no regional redundancy.. Remember: Backup protects against data loss not infrastructure failure. Recovery Services Vault for VMs and SQL. Vault must be in same region as protected resources. Soft delete retains deleted backup data for 14 days. Never disable it.
🏦
The Metaphor
Azure Backup is like a bank safety deposit box for your data. The vault (Recovery Services Vault) stores copies of everything important. The policy (Backup Policy) defines how often you make copies and how long you keep each version. When disaster strikes, you go to the vault and retrieve exactly what you need — the whole box, a specific folder, or a single file.
Recovery Services Vault vs Backup Vault
Feature
Recovery Services Vault
Backup Vault
VM backup
✓
✗
SQL Server backup
✓
✗
Azure Site Recovery
✓
✗
Blob backup (operational)
✗
✓
Disk backup
✗
✓
AKS backup
✗
✓
Status
Established — use for VMs
Newer service — growing
📋
Vault Must Be in Same Region as Resource
Recovery Services Vault must be in the same region as the VMs you want to back up. You cannot back up an East US VM to a West US vault. Plan vault placement before deployment.
VM Backup — How It Works
VM Backup Flow — Snapshot → Vault → Restore
⚠️
Soft Delete — Accidentally Deleted Backup Data
Azure Backup has soft delete enabled by default (14-day retention of deleted backup data). If you delete a VM's backup, the data isn't immediately purged — it sits in soft-deleted state for 14 days. You can recover it. To permanently delete: explicitly stop backup AND delete data, then wait 14 days OR disable soft delete first (not recommended for production).
Backup Policies & Retention
Tier
Frequency
Retention
Use Case
Daily snapshots
Once per day
7–9999 days
Standard VM backup
Weekly
1 per week
Up to 5163 weeks
Weekly restore point
Monthly
1 per month
Up to 1188 months
Monthly compliance
Yearly
1 per year
Up to 99 years
Long-term compliance
Enhanced policy
Multiple per day (hourly)
Same as above
Low RPO requirements
ℹ️
Vault Redundancy Options
GRS (Geo-Redundant Storage): default — backup data replicated to paired region. Cross-region restore enabled. Higher cost. LRS (Locally-Redundant): cheaper, data stays in one region. No cross-region restore. ZRS (Zone-Redundant): protects against zone failure within same region.
ASR continuously replicates VM disk writes to a target region. If primary region fails, you can fail over in minutes. Backup = restore from past. ASR = stay running with near-zero downtime.
Azure Site Recovery is your disaster recovery service, and the distinction from backup is fundamental. Backup takes snapshots at scheduled intervals. If your last backup was at midnight and primary region fails at 11:55 PM, you have lost almost 24 hours of data. Restoring from backup takes hours. Azure Site Recovery continuously replicates your VM's disk writes to a secondary region. Every write your VM makes is replicated asynchronously within seconds to a minute. If your primary region fails, you fail over to the secondary region and are running again in minutes, with data loss measured in seconds, not hours. RPO and RTO define disaster recovery quality. Recovery Point Objective is how much data you can afford to lose. ASR typically achieves under one minute RPO. Recovery Time Objective is how long it takes to restore service. With ASR and well-tested recovery plans, RTO is typically 15 to 30 minutes. Architecture: in your primary region the VM is running. ASR's mobility agent captures disk writes. These are replicated asynchronously to the secondary region. In the secondary region a replica VM exists in a non-running standby state, kept up to date with all latest data. When you trigger failover, Azure starts the replica VM. DNS or Traffic Manager switches to the secondary region. Service resumes. Three failover types. Test Failover is for DR drills. Azure starts the replica VM in an isolated network completely separate from production. You verify everything works, then clean up. Production is unaffected. Run test failovers regularly. Planned Failover is for scheduled migrations or maintenance. Azure synchronises primary and secondary and does a clean handoff with minimal or zero data loss. Unplanned Failover is for actual disasters when the primary region is down. Data loss is possible up to the RPO. After failing over you will want to fail back to the primary eventually. ASR handles re-protection and failback as a reverse replication operation.. Remember: ASR continuously replicates, backup takes periodic snapshots. RPO under one minute for ASR. Test Failover uses isolated network, does not affect production. Always test your DR. Re-protect after failover to enable failback.
🔄
The Metaphor
Backup = taking a photo of your house every night. If it burns down, you can rebuild it from photos — but it takes days and you get yesterday's version.
ASR = having an identical second house always ready with a live mirror of everything happening in the first house. Your family can move in within minutes. You lose at most minutes of data.
ASR — Continuous Replication: Primary → Secondary Region
Animated — ASR Failover: Primary fails → Secondary activated
Failover Type
Use When
Impact
Test Failover
DR drill — validate recovery works
None — uses isolated VNet, production unaffected
Planned Failover
Scheduled maintenance, region migration
Minimal — both sites sync before switch
Unplanned Failover
Primary region disaster
Possible data loss up to RPO (minutes)
⚠️
Backup ≠ Disaster Recovery
Backup: protects against accidental deletion, corruption, ransomware. Restore takes hours/days. RPO = last backup. ASR: protects against regional failure. Failover in minutes. RPO = seconds/minutes.
You need BOTH. Backup covers user error. ASR covers infrastructure failure. They're complementary, not alternatives.
Update Manager assesses, deploys, and reports on OS patches across Azure VMs, on-premises servers, and Arc-enabled servers — from a single pane.
Azure Update Manager is the centrally managed patch management service for Azure VMs and Arc-enabled servers. The old approach used a combination of Log Analytics Workspace and Azure Automation Account. You needed to create both services, connect them together, onboard each VM, and manage everything through the Automation Account. It worked but was complex and is now being deprecated. Azure Update Manager is the replacement. Built natively into the Azure platform. No Log Analytics Workspace required. No Automation Account required. Manage patches directly from the VM resource or from the Update Manager hub in the Azure Portal. If you see exam scenarios about update management without Log Analytics, the answer is Azure Update Manager. Patch orchestration modes. AutomaticByOS means the VM's operating system handles updates using its built-in mechanism, Windows Update for Windows, package managers for Linux. Azure has no control over when patches are applied or what reboots happen. The VM might reboot unexpectedly during business hours. AutomaticByPlatform means Azure orchestrates patching during maintenance windows you define. Azure applies patches during your low-traffic periods, your scheduled maintenance slots. You define patch on Sunday nights between 2am and 6am. Azure applies all pending updates during that window. No surprise reboots during business hours. Manual mode means nothing happens automatically. You trigger patching on-demand with full visibility into exactly what patches are being applied. Assessment is a key feature. Update Manager scans all your VMs and shows which patches are missing, what CVEs apply to your environment, and severity ratings, all without applying anything. This gives you current vulnerability posture across your entire fleet. Cross-platform support: Update Manager works for Azure VMs and also for on-premises servers and VMs in other clouds connected via Azure Arc. One pane of glass for patch compliance across your entire hybrid environment.. Remember: Update Manager replaces Log Analytics Update Management with no workspace or automation account needed. AutomaticByOS means the OS handles it without Azure control. AutomaticByPlatform means Azure applies patches in your maintenance window. Works for Azure VMs and Arc-enabled servers.
Key Features
Centralised Patch Management
Assessment: shows missing updates, CVEs, severity across all VMs.
Scheduled patching: define maintenance windows — patch automatically during approved hours.
On-demand patching: patch now without waiting for schedule.
Patch orchestration: control order of patches across multiple VMs (update rings).
Works with: Azure VMs, Azure Arc-enabled servers (on-prem, other clouds).
Patch Modes
How Updates Are Applied
AutomaticByOS: VM's OS handles updates automatically (Windows Update). Azure has no control.
AutomaticByPlatform: Azure orchestrates patching during maintenance windows. No manual intervention needed.
Manual: you control everything — use Update Manager to trigger when needed.
For production: AutomaticByPlatform + maintenance window = patches applied in controlled slots, minimising unplanned reboots.
📋
Update Manager vs Log Analytics (Legacy)
Old approach: Update Management via Log Analytics workspace + Automation Account. This is being deprecated. New approach: Azure Update Manager — no Log Analytics or Automation Account needed. Native Azure service, directly on the VM resource. AZ-104 tests the new approach.
Azure Cost Management provides visibility into spend, budgets, alerts, and cost optimisation recommendations. It does NOT automatically stop resources — it only alerts.
The single most important fact about Azure Cost Management: budgets send alerts only. They cannot and do not automatically stop, pause, restrict, or delete Azure resources. A budget alert fires. Azure sends you an email. That is all. Your VMs keep running. Your storage keeps accumulating. Your costs keep climbing. The budget itself has zero ability to touch your resources. Zero.. The exam scenario is always: a budget alert fires at 90% of the monthly budget, what happens to the running VMs? The correct answer is nothing. They keep running. If you want automated action when a budget threshold is hit, you must configure a budget alert to trigger an Action Group, and that Action Group calls an Azure Function or Logic App that programmatically stops or deallocates resources. The budget triggers the alert. The Action Group calls the automation. The automation takes the action. Three separate steps. The budget does only the first. Azure Cost Management does many useful things. The Cost Analysis view gives flexible filterable breakdowns of your spending. Filter by subscription, resource group, resource type, service, tag, or time period. Tags are essential for cost allocation. If every resource has CostCenter, Environment, and Project tags, your finance team can see exactly what each workload costs. Without tags, everything is one undifferentiated blob of Azure spend. Azure Advisor analyses resource utilisation and makes specific recommendations. Identify underused VMs, orphaned disks not attached to any VM but still incurring storage charges, unused public IP addresses. Reserved Instances are the most impactful cost reduction available. Commit to a 1-year or 3-year reservation for a VM and save up to 72% compared to pay-as-you-go. Ideal for stable production workloads that run 24 by 7. Azure Hybrid Benefit lets you apply existing Windows Server or SQL Server licences to Azure VMs, reducing licence cost by roughly 40%. Spot VMs use spare Azure compute capacity at up to 90% discount. The massive catch: Azure can evict Spot VMs with only 30 seconds notice. Only for interruptible batch workloads, never for production.. Remember: Budgets send alerts only, they cannot stop resources. Tags enable cost allocation. Reserved Instances save up to 72% for committed workloads. Spot VMs save 90% but can be evicted with 30 seconds notice.
💰
The Metaphor
Cost Management is like your bank account dashboard with spending alerts. It shows what you've spent, predicts what you'll spend, and sends you a text when you hit 80% of your monthly limit. But it won't freeze your card automatically — that's your job to act on the alert.
Budgets
Alerts Only — No Auto-Stop
Budgets define a spending threshold and trigger alerts when reached.
Alert types: Cost alert: fires when actual spend hits % of budget Forecast alert: fires when projected spend will exceed budget
What budgets CANNOT do: stop VMs, delete resources, deny new deployments. They ONLY alert. To take action you need Action Groups connected to automation.
Scope: Management Group, Subscription, Resource Group, or tagged resources.
Reserved Instances (RI): commit to 1 or 3 years → up to 72% savings vs pay-as-you-go.
Azure Hybrid Benefit: use existing Windows Server / SQL Server licences on Azure VMs.
Spot VMs: up to 90% cheaper — but can be evicted with 30-second notice. For batch jobs, dev/test only.
Dev/Test pricing: cheaper rates for non-production subscriptions.
⚠️
Budget = Alert Only — Most Missed Exam Point
This is tested every exam sitting. "A budget alert fires at 80% spend — what happens to the VMs?" Answer: nothing automatically. VMs keep running. Only emails/SMS are sent. To stop VMs automatically when budget is exceeded: attach an Action Group that triggers an Azure Function or Logic App that stops/deallocates VMs. The budget itself has zero power to affect resources.
Cost Tool
What It Does
Action?
Cost Analysis
Visualise and analyse spend by resource, tag, service
View only
Budgets
Set thresholds and send alerts
Alert only — no auto-action
Advisor
Recommendations to reduce cost and improve reliability
Manual action needed
Reservations
Pre-pay for 1-3 years for big discounts
Commitment purchase
Exports
Schedule cost data export to Storage Account for BI tools
Click to reveal. Monitor & Maintain questions often test subtle distinctions.
QA budget alert fires at 90% of $1000 monthly budget. What happens to the running VMs?
ANSWER
Nothing. VMs continue running. Only alerts (email/SMS) are sent.
Budgets are alert mechanisms only. They have zero ability to stop, pause, or modify resources. To take automated action when budget threshold is hit: configure the budget alert to trigger an Action Group, which runs an Azure Function or Logic App to stop VMs.⚠️ This is the single most repeated exam question in Phase 5.
QA VM's NSG flow logs are not appearing in Log Analytics. Diagnostic Settings are configured. What is the most likely cause?
ANSWER
NSG Flow Logs are configured separately via Network Watcher, not Diagnostic Settings.
NSG Flow Logs require: Network Watcher enabled in the region → NSG Flow Logs configured pointing to a Storage Account (required) and optionally to Log Analytics via Traffic Analytics. Diagnostic Settings on the NSG itself logs NSG audit events — not packet flow data. These are two separate logging mechanisms.
QWhat is the difference between Backup and ASR? When would you use each?
ANSWER
Backup: protects against data loss (deletion, corruption). Restore takes hours. Use for accidental deletion, ransomware, data corruption.
ASR: protects against infrastructure failure (region outage). Failover takes minutes. Use for business continuity when entire region fails.
They are complementary. Production environments need both: Backup for data protection + ASR for DR. RPO for ASR = seconds. RPO for Backup = last backup (hours).
QA VM backup was accidentally deleted. How can it be recovered?
ANSWER
If soft delete is enabled (default): the backup data is retained for 14 days in a soft-deleted state. Undelete from the Recovery Services Vault → Backup Items → find the soft-deleted item → Undelete.
If soft delete was disabled: permanently deleted, no recovery possible.
Soft delete is ON by default since 2020. Best practice: never disable it for production vaults.
QHow do Metrics and Logs differ in Azure Monitor?
ANSWER
Metrics: numeric time-series, collected automatically, near real-time (1 min), stored 93 days, fast queries.
Logs: structured records, NOT collected automatically (need Diagnostic Settings), queried with KQL, stored in Log Analytics Workspace, 30 days free retention.
Use metrics for: dashboards, autoscale triggers, threshold alerts. Use logs for: audit trails, complex queries, security investigations.
QAn alert rule fires but no notification is received. Alert history shows "Fired" status. What is missing?
ANSWER
The alert rule has no Action Group attached, or the Action Group has no notification configured.
An alert rule without an Action Group fires silently — the condition is evaluated and recorded but nothing is sent. Action Group must be attached to the alert rule AND configured with at least one action (email, SMS, etc.). Check: Alert Rule → Actions → Action Groups.
Phase 5 — Cheat Sheet
Budget alert fires — what happens to VMs?Nothing — alerts only, no auto-action
Metrics retention93 days — automatic, no config needed
Log Analytics default retention30 days free (up to 2 years paid)