5.1 — Observability

Azure Monitor

MS Docs

Azure Monitor is the single umbrella service for all monitoring in Azure — metrics, logs, alerts, dashboards, and insights. Everything flows through it.

📖 Azure Monitor — Complete Explanation

Azure Monitor is the unified observability platform for everything running in Azure (and on-premises via Azure Arc). Before Azure Monitor, different services had separate monitoring tools — VMs had Log Analytics, App Service had Application Insights, Activity Log was separate. Azure Monitor is the umbrella that consolidates everything.

Two types of data — Metrics and Logs: Metrics are lightweight numeric measurements collected every minute automatically for every Azure resource. They answer "what is the current state?" (CPU is 75%, 3 requests/second). Logs are rich structured records capturing events, errors, and audit trails. They answer "what happened and why?" (Error 500 at 14:32:01, triggered by request from 203.x.x.x).

Why you can't query Metrics with KQL: Metrics go into a dedicated time-series database optimised for fast numeric retrieval and graphing. Logs go into Log Analytics workspace where Kusto Query Language (KQL) runs complex aggregations and joins. Different stores for different purposes.

The diagnostic settings gap: Azure resources generate platform metrics automatically — you always have CPU, memory, disk IOPS without any configuration. But resource-specific logs (what keys were accessed in Key Vault, which requests failed in App Service, which NSG rules matched traffic) are NOT collected by default. You must explicitly configure Diagnostic Settings on each resource to route these logs to a Log Analytics workspace, Storage Account, or Event Hub. This is a very common real-world mistake — organisations assume everything is being logged, then discover after a security incident that resource logs were never enabled.

Action Groups are reusable: One Action Group (e.g., "ag-kube-oncall") can be attached to 50 different alert rules. When any of those alerts fire, the same team gets notified. This is the correct design — not creating separate notification configs per alert.

🏥
The Metaphor

Azure Monitor is like a hospital monitoring system. Every patient (resource) has sensors collecting real-time vitals (Metrics) and a medical notes file (Logs). Doctors (alerts) watch for danger thresholds. When something goes wrong, the alarm sounds (alert fires) and the response team (Action Group) is automatically paged.

Azure Monitor — Data Flow Architecture
DATA SOURCES Azure Resources VMs (Agent) Applications Activity Log Guest OS (AMA) Azure Monitor Metrics (real-time) Logs (Log Analytics) OUTPUTS & ACTIONS Alerts → Action Group Dashboards Azure / Grafana Workbooks Interactive reports Insights VM / Container / App Action Group Email / SMS Logic App / Webhook Azure Function ITSM / PagerDuty Metrics: real-time, stored 93 days. Logs: query with KQL in Log Analytics, stored 30 days default (up to 2 years).

Metrics vs Logs — Key Difference

Metrics
Numerical Time-Series — Real-Time
Numeric values collected at regular intervals (CPU %, memory, requests/sec).

Stored in a dedicated time-series database for 93 days.

Fast — available within 1 minute. Use for dashboards, autoscale triggers, threshold alerts.

Platform metrics = free, automatic. Custom metrics = app code sends them.
Logs
Structured Records — Queryable with KQL
Text/structured records: events, errors, audit trails, performance counters.

Stored in Log Analytics Workspace. Retention: 30 days free, up to 2 years paid.

Queried with KQL (Kusto Query Language). Slower than metrics — seconds to minutes.

Resource logs (diagnostic logs) require Diagnostic Settings to route to workspace.

Log Analytics Workspace

Central Log Store
One Workspace — Many Sources
Log Analytics Workspace (LAW) = central store for all Azure logs. VMs, App Services, NSG Flow Logs, Activity Logs, Entra ID Sign-In logs — all sent here.

KQL example — find failed sign-ins:
SigninLogs | where ResultType != 0 | summarize count() by UserPrincipalName

Agents: Azure Monitor Agent (AMA) — current standard. Replaces legacy MMA/OMS agent. Must be installed on VMs to collect guest OS metrics and logs.
KQL — Useful Queries for AZ-104
-- VMs with high CPU in last hour
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(1h)
| summarize avg(CounterValue) by Computer
| where avg_CounterValue > 80
| order by avg_CounterValue desc

-- Failed sign-ins by user
SigninLogs
| where ResultType != "0"
| summarize FailCount=count() by UserPrincipalName
| order by FailCount desc

-- All activity in a resource group today
AzureActivity
| where ResourceGroup == "RG-Kube-Prod"
| where TimeGenerated > startofday(now())
| project TimeGenerated, Caller, OperationName, ActivityStatus
| order by TimeGenerated desc

Alerts & Action Groups

Alert Rule Components
Alert Rule = Scope + Condition + Action Group Scope What to monitor Subscription / RG Resource type Specific resource Condition (Signal) Metric: CPU % > 90 for 5 min Log: failed login count > 10 Activity: VM deleted (any) Health: resource goes down Action Group Email: kiran@kube.com SMS: +91 98xxx Logic App: create ITSM ticket Function: auto-restart service
Animated — Alert Firing: CPU spike → email + auto-remediation
VM CPU 92% 🔴 exceeds Alert Rule CPU > 90% for 5 min FIRES ⚡ Action Group ag-kube-prod notifies all Email: Kiran SMS alert Function: scale out Action Groups are reusable — one AG can be attached to multiple alert rules
📋
Alert Signal Types
Metric alert: fires when metric value crosses threshold. Near-real-time (1 min).
Log alert: runs KQL query on schedule, fires if result count exceeds threshold.
Activity log alert: fires on Azure management operations (VM deleted, RBAC changed).
Smart detection: Application Insights ML-based anomaly detection.

Diagnostic Settings

Route Resource Logs
Resource Logs Don't Flow Automatically — Must Configure
By default, resource-specific logs (NSG flow logs, App Service logs, Key Vault access logs) are NOT sent anywhere. You must create Diagnostic Settings per resource to route them.

Destinations:
Log Analytics Workspace — query with KQL, alerts, dashboards
Storage Account — long-term archival, compliance
Event Hub — stream to SIEM (Splunk, Sentinel, external tools)

Diagnostic Settings can send to multiple destinations simultaneously.
⚠️
Activity Log vs Resource Logs — Different Things
Activity Log: who did what to Azure resources (control plane). Automatically retained 90 days. Captures: VM created, RG deleted, RBAC assigned.

Resource Logs (Diagnostic Logs): what happened inside the resource (data plane). NOT retained automatically. Must configure Diagnostic Settings. Captures: Key Vault secret accessed, App Service 500 errors, NSG rule matches.
5.2 — Data Protection

Azure Backup

MS Docs

Azure Backup provides cloud-based backup for VMs, disks, files, SQL, SAP HANA, and Blob storage. Zero infrastructure to manage — backup data stored in a Recovery Services Vault or Backup Vault.

🏦
The Metaphor

Azure Backup is like a bank safety deposit box for your data. The vault (Recovery Services Vault) stores copies of everything important. The policy (Backup Policy) defines how often you make copies and how long you keep each version. When disaster strikes, you go to the vault and retrieve exactly what you need — the whole box, a specific folder, or a single file.

Recovery Services Vault vs Backup Vault

FeatureRecovery Services VaultBackup Vault
VM backup
SQL Server backup
Azure Site Recovery
Blob backup (operational)
Disk backup
AKS backup
StatusEstablished — use for VMsNewer service — growing
📋
Vault Must Be in Same Region as Resource
Recovery Services Vault must be in the same region as the VMs you want to back up. You cannot back up an East US VM to a West US vault. Plan vault placement before deployment.

VM Backup — How It Works

VM Backup Flow — Snapshot → Vault → Restore
Azure VM OS + Data disks No agent needed snapshot Recovery Point Incremental snapshot of managed disks transfer Recovery Services Vault GRS by default multiple restore points Full VM Restore replace or new VM File Recovery mount disk, grab files
⚠️
Soft Delete — Accidentally Deleted Backup Data
Azure Backup has soft delete enabled by default (14-day retention of deleted backup data). If you delete a VM's backup, the data isn't immediately purged — it sits in soft-deleted state for 14 days. You can recover it. To permanently delete: explicitly stop backup AND delete data, then wait 14 days OR disable soft delete first (not recommended for production).

Backup Policies & Retention

TierFrequencyRetentionUse Case
Daily snapshotsOnce per day7–9999 daysStandard VM backup
Weekly1 per weekUp to 5163 weeksWeekly restore point
Monthly1 per monthUp to 1188 monthsMonthly compliance
Yearly1 per yearUp to 99 yearsLong-term compliance
Enhanced policyMultiple per day (hourly)Same as aboveLow RPO requirements
ℹ️
Vault Redundancy Options
GRS (Geo-Redundant Storage): default — backup data replicated to paired region. Cross-region restore enabled. Higher cost.
LRS (Locally-Redundant): cheaper, data stays in one region. No cross-region restore.
ZRS (Zone-Redundant): protects against zone failure within same region.
Azure CLI — Enable VM Backup
# Create Recovery Services Vault
az backup vault create \
  --resource-group RG-Prod \
  --name rsv-kube-prod \
  --location eastus

# Enable backup for a VM (uses DefaultPolicy by default)
az backup protection enable-for-vm \
  --resource-group RG-Prod \
  --vault-name rsv-kube-prod \
  --vm kube-vm-prod-01 \
  --policy-name DefaultPolicy

# Trigger on-demand backup
az backup protection backup-now \
  --resource-group RG-Prod \
  --vault-name rsv-kube-prod \
  --container-name IaasVMContainer;iaasvmcontainerv2;RG-Prod;kube-vm-prod-01 \
  --item-name VM;iaasvmcontainerv2;RG-Prod;kube-vm-prod-01 \
  --retain-until 15-12-2026
5.3 — Disaster Recovery

Azure Site Recovery (ASR)

MS Docs

ASR continuously replicates VM disk writes to a target region. If primary region fails, you can fail over in minutes. Backup = restore from past. ASR = stay running with near-zero downtime.

🔄
The Metaphor

Backup = taking a photo of your house every night. If it burns down, you can rebuild it from photos — but it takes days and you get yesterday's version.

ASR = having an identical second house always ready with a live mirror of everything happening in the first house. Your family can move in within minutes. You lose at most minutes of data.

ASR — Continuous Replication: Primary → Secondary Region
Primary Region: East US VM (running) Writes every second Production ✓ RPO: typically < 1 minute RTO: minutes (via recovery plan) continuous async replication Secondary Region: West US Replica VM Not running (standby) Ready to start Recovery Services Vault (West US) stores recovery points Failover: Test Failover (non-disruptive) · Planned Failover · Unplanned Failover
Animated — ASR Failover: Primary fails → Secondary activated
East US VM 💥 DOWN region failure detected Trigger Failover Unplanned or auto Recovery Plan runs steps in order activates West US VM ✓ RUNNING latest replicated state Production restored Typical Timeline RPO: < 1 minute RTO: 15–30 minutes Always run Test Failover first — uses isolated network, no impact to production, validates DR works
Failover TypeUse WhenImpact
Test FailoverDR drill — validate recovery worksNone — uses isolated VNet, production unaffected
Planned FailoverScheduled maintenance, region migrationMinimal — both sites sync before switch
Unplanned FailoverPrimary region disasterPossible data loss up to RPO (minutes)
⚠️
Backup ≠ Disaster Recovery
Backup: protects against accidental deletion, corruption, ransomware. Restore takes hours/days. RPO = last backup.
ASR: protects against regional failure. Failover in minutes. RPO = seconds/minutes.

You need BOTH. Backup covers user error. ASR covers infrastructure failure. They're complementary, not alternatives.
5.4 — Patch Management

Azure Update Manager

MS Docs

Update Manager assesses, deploys, and reports on OS patches across Azure VMs, on-premises servers, and Arc-enabled servers — from a single pane.

Key Features
Centralised Patch Management
Assessment: shows missing updates, CVEs, severity across all VMs.

Scheduled patching: define maintenance windows — patch automatically during approved hours.

On-demand patching: patch now without waiting for schedule.

Patch orchestration: control order of patches across multiple VMs (update rings).

Works with: Azure VMs, Azure Arc-enabled servers (on-prem, other clouds).
Patch Modes
How Updates Are Applied
AutomaticByOS: VM's OS handles updates automatically (Windows Update). Azure has no control.

AutomaticByPlatform: Azure orchestrates patching during maintenance windows. No manual intervention needed.

Manual: you control everything — use Update Manager to trigger when needed.

For production: AutomaticByPlatform + maintenance window = patches applied in controlled slots, minimising unplanned reboots.
📋
Update Manager vs Log Analytics (Legacy)
Old approach: Update Management via Log Analytics workspace + Automation Account. This is being deprecated. New approach: Azure Update Manager — no Log Analytics or Automation Account needed. Native Azure service, directly on the VM resource. AZ-104 tests the new approach.
5.5 — Cost Governance

Cost Management

MS Docs

Azure Cost Management provides visibility into spend, budgets, alerts, and cost optimisation recommendations. It does NOT automatically stop resources — it only alerts.

💰
The Metaphor

Cost Management is like your bank account dashboard with spending alerts. It shows what you've spent, predicts what you'll spend, and sends you a text when you hit 80% of your monthly limit. But it won't freeze your card automatically — that's your job to act on the alert.

Budgets
Alerts Only — No Auto-Stop
Budgets define a spending threshold and trigger alerts when reached.

Alert types:
Cost alert: fires when actual spend hits % of budget
Forecast alert: fires when projected spend will exceed budget

What budgets CANNOT do: stop VMs, delete resources, deny new deployments. They ONLY alert. To take action you need Action Groups connected to automation.

Scope: Management Group, Subscription, Resource Group, or tagged resources.
Cost Optimisation
Reduce Azure Spend
Advisor recommendations: right-size underused VMs, delete unused resources.

Reserved Instances (RI): commit to 1 or 3 years → up to 72% savings vs pay-as-you-go.

Azure Hybrid Benefit: use existing Windows Server / SQL Server licences on Azure VMs.

Spot VMs: up to 90% cheaper — but can be evicted with 30-second notice. For batch jobs, dev/test only.

Dev/Test pricing: cheaper rates for non-production subscriptions.
⚠️
Budget = Alert Only — Most Missed Exam Point
This is tested every exam sitting. "A budget alert fires at 80% spend — what happens to the VMs?" Answer: nothing automatically. VMs keep running. Only emails/SMS are sent. To stop VMs automatically when budget is exceeded: attach an Action Group that triggers an Azure Function or Logic App that stops/deallocates VMs. The budget itself has zero power to affect resources.
Cost ToolWhat It DoesAction?
Cost AnalysisVisualise and analyse spend by resource, tag, serviceView only
BudgetsSet thresholds and send alertsAlert only — no auto-action
AdvisorRecommendations to reduce cost and improve reliabilityManual action needed
ReservationsPre-pay for 1-3 years for big discountsCommitment purchase
ExportsSchedule cost data export to Storage Account for BI toolsAutomated data export
Exam Prep

Phase 5 — Exam Q&A

Exam Guide

Click to reveal. Monitor & Maintain questions often test subtle distinctions.

QA budget alert fires at 90% of $1000 monthly budget. What happens to the running VMs?
ANSWER
Nothing. VMs continue running. Only alerts (email/SMS) are sent.

Budgets are alert mechanisms only. They have zero ability to stop, pause, or modify resources. To take automated action when budget threshold is hit: configure the budget alert to trigger an Action Group, which runs an Azure Function or Logic App to stop VMs.⚠️ This is the single most repeated exam question in Phase 5.
QA VM's NSG flow logs are not appearing in Log Analytics. Diagnostic Settings are configured. What is the most likely cause?
ANSWER
NSG Flow Logs are configured separately via Network Watcher, not Diagnostic Settings.

NSG Flow Logs require: Network Watcher enabled in the region → NSG Flow Logs configured pointing to a Storage Account (required) and optionally to Log Analytics via Traffic Analytics. Diagnostic Settings on the NSG itself logs NSG audit events — not packet flow data. These are two separate logging mechanisms.
QWhat is the difference between Backup and ASR? When would you use each?
ANSWER
Backup: protects against data loss (deletion, corruption). Restore takes hours. Use for accidental deletion, ransomware, data corruption.

ASR: protects against infrastructure failure (region outage). Failover takes minutes. Use for business continuity when entire region fails.

They are complementary. Production environments need both: Backup for data protection + ASR for DR. RPO for ASR = seconds. RPO for Backup = last backup (hours).
QA VM backup was accidentally deleted. How can it be recovered?
ANSWER
If soft delete is enabled (default): the backup data is retained for 14 days in a soft-deleted state. Undelete from the Recovery Services Vault → Backup Items → find the soft-deleted item → Undelete.

If soft delete was disabled: permanently deleted, no recovery possible.

Soft delete is ON by default since 2020. Best practice: never disable it for production vaults.
QHow do Metrics and Logs differ in Azure Monitor?
ANSWER
Metrics: numeric time-series, collected automatically, near real-time (1 min), stored 93 days, fast queries.

Logs: structured records, NOT collected automatically (need Diagnostic Settings), queried with KQL, stored in Log Analytics Workspace, 30 days free retention.

Use metrics for: dashboards, autoscale triggers, threshold alerts. Use logs for: audit trails, complex queries, security investigations.
QAn alert rule fires but no notification is received. Alert history shows "Fired" status. What is missing?
ANSWER
The alert rule has no Action Group attached, or the Action Group has no notification configured.

An alert rule without an Action Group fires silently — the condition is evaluated and recorded but nothing is sent. Action Group must be attached to the alert rule AND configured with at least one action (email, SMS, etc.). Check: Alert Rule → Actions → Action Groups.
Phase 5 — Cheat Sheet
Budget alert fires — what happens to VMs?Nothing — alerts only, no auto-action
Metrics retention93 days — automatic, no config needed
Log Analytics default retention30 days free (up to 2 years paid)
Resource logs (Diagnostic Logs) — automatic?No — must configure Diagnostic Settings
Activity Log retention90 days automatic, then needs archival
Alert fires but no notificationNo Action Group attached to alert rule
Alert rule componentsScope + Condition + Action Group
Backup vault must be inSame region as VMs being backed up
Deleted backup — 14 day recoverySoft delete (on by default) — undelete from vault
Backup vs ASRBackup = data loss protection. ASR = DR / region failure
ASR failover with no impact to prodTest Failover — uses isolated VNet
ASR RPO typical< 1 minute
Patch without Log Analytics/AutomationAzure Update Manager — new native service
Biggest VM cost savings commitmentReserved Instances — up to 72% for 1-3 year
Spot VMs use caseBatch / dev-test only — can be evicted anytime