Azure Disaster Recovery Architecture: Active-Active vs Active-Passive for Malaysian Enterprises
Disaster recovery on Azure requires a fundamental architectural decision: should you run your workload in one region with a standby (active-passive) or in multiple regions simultaneously (active-active)? For Malaysian enterprises, this choice is complicated by data residency requirements in Malaysia West, the available regional pair architecture, and the cost sensitivity typical of the local market.
This article examines both architectures in detail, provides concrete implementation guidance, and helps you decide which is right for your specific RTO, RPO, and budget constraints.
Understanding the Regional Design Constraint
Azure offers Malaysia West as a primary region, but Microsoft lists Malaysia West with no Azure paired region. Southeast Asia (Singapore) is still a common nearby DR candidate, but it should not be described as Malaysia West's formal paired region.
This distinction matters because:
- Do not assume paired-region behavior — Microsoft documents benefits for formal Azure region pairs, such as staggered platform updates and service-specific geo-redundancy patterns, but those assumptions should be validated per service when the selected secondary is not a listed pair.
- Data residency is not automatic — Malaysia West to Southeast Asia is a cross-border DR design. Personal data replicated to Singapore may trigger Malaysia PDPA cross-border transfer obligations. You must classify the data, check contractual controls, and confirm whether consent or another permitted transfer basis applies.
- Latency must be measured for the workload — Malaysia-to-Singapore latency is often acceptable for web failover, but it should be benchmarked for each application. Avoid synchronous database write patterns across regions unless the application has been explicitly designed and tested for that latency.
In short: treat Southeast Asia as a practical secondary region candidate, not as an automatic paired-region target. The design decision must be based on service support, compliance, latency, and recovery objectives.
Active-Passive (Cold/Warm Standby) Architecture
In an active-passive architecture, one Azure region handles all production traffic. The DR region has infrastructure deployed and ready, but it remains idle — or running at minimal capacity — until failover is triggered.
How It Works
Malaysia West Southeast Asia (Singapore)
┌─────────────────┐ ┌──────────────────────┐
│ Active (100%) │ │ Passive (Standby) │
│ │ │ │
│ App Gateway │ │ App Gateway (off) │
│ App Service │──────>│ App Service (cold) │
│ SQL Database │──────>│ SQL Database (DR) │
│ Blob Storage │──────>│ Blob Storage (GRS) │
└─────────────────┘ └──────────────────────┘
Traffic flows to Malaysia West normally. Data is replicated to Southeast Asia asynchronously. During a disaster, you initiate failover: traffic is redirected, the standby resources are activated, and the application resumes from the DR region.
RTO and RPO Expectations
| Component | RTO | RPO |
|---|---|---|
| Azure SQL Database (Geo-restore) | Minutes to hours | Around 1 hour, depending on backup timing |
| Azure SQL Database (Active Geo-replication / failover groups) | Typically less than 60 seconds for database failover, excluding dependent application components | Asynchronous replication; data loss can be greater than zero if recent changes have not replicated |
| Azure VMs (Azure Site Recovery) | Depends on recovery plan, VM boot time, DNS/routing, and validation steps | Depends on configured replication and latest available recovery point |
| Azure Blob/Storage (GRS or RA-GRS) | Depends on account failover and application design | Geo-replication is asynchronous; Microsoft documents Block Blob geo priority replication RPO as less than or equal to 15 minutes where that feature applies |
| App Service (pre-provisioned DR deployment) | Depends on traffic manager/front-door configuration and app warm-up | Varies by application state and deployment model |
The ranges are wide because RTO and RPO depend on the specific replication technologies you choose. Azure SQL failover groups and active geo-replication can provide fast database failover, but they still use asynchronous replication and do not remove the need to design the application, identity, storage, networking, and DNS layers for recovery.
Cost Modeling: Active-Passive for a Web Application
A typical three-tier web application (web tier, API tier, database) in an active-passive DR setup:
| Resource | Primary Region | DR Region (Passive) |
|---|---|---|
| App Service / API compute | Production sizing | Stopped, scaled down, or pre-provisioned depending on RTO |
| SQL Database | Production sizing | Geo-secondary or failover group secondary sized to recovery/read needs |
| Azure Front Door / routing | Global service | Same global instance, with secondary origin configured |
| Azure Site Recovery | Protected VM scope | Replication, storage, and recovery-plan configuration |
| Storage | Primary redundancy choice | GRS/RA-GRS/GZRS or explicit replication pattern, subject to data residency |
Illustrative active-passive model: the DR footprint is usually materially cheaper than a fully active second region because some compute can be stopped, scaled down, or kept as infrastructure-as-code until failover. Treat any percentage as workload-specific rather than universal.
For pricing, validate the actual SKU, region, licensing benefit, backup retention, data transfer, and support plan in Azure Pricing Calculator before using the figures in a proposal or budget.
Implementation: Active-Passive with Azure Site Recovery
Azure Site Recovery (ASR) orchestrates the replication, failover, and failback of Azure VMs between regions.
# Implementation note:
# Azure Site Recovery for Azure-to-Azure VM replication is usually configured
# from the Azure portal or automation generated after the Recovery Services
# vault, fabric, protection container, policy, network mapping, and protected
# item relationships exist. Do not treat the following as a copy-paste script.
# Verify the Site Recovery extension is available in the installed Azure CLI.
az extension add --name site-recovery
az site-recovery --help
# Recommended operational pattern:
# 1. Create or select a Recovery Services vault in the DR region.
# 2. Enable replication for each VM and confirm target network, subnet,
# VM size, disk type, and availability-zone settings.
# 3. Build a recovery plan grouping app, API, and database tiers.
# 4. Run test failover into an isolated VNet at least quarterly.
# 5. Record actual RTO/RPO from the test, not just design targets.
Implementation: Active-Passive with Azure SQL Geo-Replication
For a better RPO than ASR, configure Active Geo-Replication on Azure SQL Database:
-- On primary database in Malaysia West
ALTER DATABASE [app-db]
ADD SECONDARY ON SERVER [sql-srv-southeastasia]
WITH (ALLOW_CONNECTIONS = ALL, SECONDARY_TYPE = GEO,
SERVICE_OBJECTIVE = GP_Gen5_2);
-- During failover, promote the secondary to primary
-- (run this on the secondary server)
ALTER DATABASE [app-db] FAILOVER;
The SQL driver can also handle automatic failover redirection. Use ApplicationIntent=ReadOnly in your connection string to route read-only queries to the readable secondary.
Active-Active (Geo-Redundant) Architecture
In an active-active architecture, both regions handle traffic simultaneously. Traffic is load-balanced across regions using Azure Traffic Manager or Azure Front Door. Each region runs identical infrastructure and serves user requests.
How It Works
┌──────────────────────┐
│ Azure Front Door │
│ (Global Load Balancer)│
└──────┬───────────────┘
│
┌────────────┴────────────┐
│ │
Malaysia West Southeast Asia
┌──────────────────┐ ┌──────────────────────┐
│ App Service │ │ App Service │
│ (50% traffic) │ │ (50% traffic) │
│ │ │ │
│ SQL (Read-Write) │<─────>│ SQL (Read-Only) │
│ Blob (LRS) │ │ Blob (LRS) │
└──────────────────┘ └──────────────────────┘
For active-active to work, the application must be stateless (session state must go to Redis or Cosmos DB), and the database must support multi-region writes or accept that writes happen in one region and are replicated to the other.
RTO and RPO Expectations
| Component | RTO | RPO |
|---|---|---|
| Azure Front Door routing | Often fast when origins and health probes are already configured | N/A |
| Azure SQL active geo-replication / failover group | Microsoft documents database disaster-recovery RTO as typically less than 60 seconds for customer-managed failover, excluding dependent components | Asynchronous replication; possible data loss for unreplicated recent writes |
| Application state | Near-zero only if the application is stateless or uses a region-aware state store | Depends on state-store replication model |
| DNS propagation | Avoided if Azure Front Door or another global routing layer remains the stable entry point | N/A |
Active-active can reduce infrastructure recovery time because compute is already online in the secondary region. However, database failover, replication lag, identity, secrets, dependent services, and application consistency still determine the real business RTO/RPO.
Cost Modeling: Active-Active for the Same Web Application
| Resource | Primary Region | DR Region (Active) |
|---|---|---|
| App Service / API compute | Production capacity | Production or near-production capacity |
| SQL Database | Primary database | Secondary database / failover group replica sized for recovery and read traffic |
| Azure Front Door / routing | Global service | Same global instance with both origins active or priority-routed |
| Redis / application state | Primary state tier | Region-aware state design or replicated/cache-warm strategy |
| Storage | Primary storage pattern | Secondary storage pattern selected for consistency, latency, and compliance |
Illustrative active-active model: expect a materially higher monthly cost than active-passive because compute, observability, security, and operational testing run in both regions. The premium is workload-specific and should be priced in Azure Pricing Calculator before committing to a budget.
The extra cost comes from running more resources continuously, not from a fixed universal percentage.
Implementation: Active-Active with Azure Front Door and SQL Auto-Failover
# Azure Front Door with priority routing (primary, then secondary)
az afd profile create \
--resource-group rg-global \
--profile-name afd-wenfeng \
--sku Premium_AzureFrontDoor \
--location global
az afd endpoint create \
--resource-group rg-global \
--profile-name afd-wenfeng \
--endpoint-name api-global \
--enabled-state Enabled
az afd origin-group create \
--resource-group rg-global \
--profile-name afd-wenfeng \
--origin-group-name origins-southeastasia \
--probe-request-type GET \
--probe-protocol Http \
--probe-interval-in-seconds 30 \
--probe-path /health
# Add Malaysia West as primary origin
az afd origin create \
--resource-group rg-global \
--profile-name afd-wenfeng \
--origin-group-name origins-southeastasia \
--origin-name app-malaysiawest \
--host-name app-wenfeng-malaysiawest.azurewebsites.net \
--priority 1 # Primary
# Add Southeast Asia as secondary origin
az afd origin create \
--resource-group rg-global \
--profile-name afd-wenfeng \
--origin-group-name origins-southeastasia \
--origin-name app-southeastasia \
--host-name app-wenfeng-southeastasia.azurewebsites.net \
--priority 2 # Secondary (hot standby in active-passive, active in active-active)
-- SQL Auto-Failover Group for database-level failover
-- Managed via Azure CLI:
az sql failover-group create \
--resource-group rg-db \
--server sql-malaysiawest \
--partner-server sql-southeastasia \
--name fg-app-db \
--failover-policy Automatic \
--grace-period 1 \
--add-db app-db
-- Note: With --grace-period 1 (1 hour minimum), automatic failover
-- takes at least 1 hour. Use forced failover (with data loss) for faster failover.
-- The failover group provides a single read-write endpoint
-- and a read-only listener endpoint:
-- Read-Write: fg-app-db.database.windows.net
-- Read-Only: fg-app-db.secondary.database.windows.net
Malaysian Enterprise Considerations
1. Data Residency and PDPA
Malaysia's PDPA 2010 restricts transfers of personal data to places outside Malaysia unless an allowed transfer basis applies, such as consent, contractual necessity, or taking reasonable precautions and due diligence to ensure the data will not be processed in a way that would contravene the PDPA. This has direct implications:
- Active-passive with ASR — if data is replicated to Southeast Asia (Singapore) for DR purposes, you may be transferring personal data out of Malaysia. Mitigate by: (a) identifying which data is personal data and excluding it from replication where feasible, (b) confirming the lawful transfer basis with legal counsel, and (c) using appropriate Microsoft contractual, security, and organizational controls.
- Active-active with geo-replication — the same concern applies but is magnified because data is actively written in both regions. A stricter data classification exercise is required.
- Mitigation strategies:
- Use Azure Policy to tag resources containing personal data and exclude them from regional replication.
- Implement a data classification framework before designing your DR architecture — separate personal data from non-personal data at the application level.
- Consider keeping personal data in a separate database in Malaysia West only, while replicating non-personal application data for DR.
2. Network Latency
Malaysia West to Southeast Asia (Singapore) latency should be measured from the actual application network path. For architecture planning, assume it can materially affect synchronous write paths and validate with Azure Network Watcher, application telemetry, or direct synthetic tests.
| Workload Type | Cross-region latency impact |
|---|---|
| Static web page serving | Usually low if content is cached or served from the nearest edge |
| API requests with multiple back-end round trips | Can become noticeable; measure end-to-end transaction latency |
| Real-time collaboration | Potentially problematic without region-aware design |
| Database synchronous replication | Avoid unless the platform and application explicitly support the latency profile |
| Video/voice conferencing | Usually better handled with edge/media services rather than cross-region application round trips |
For active-active architectures, replication lag is a critical metric. Azure SQL active geo-replication is asynchronous, so design for possible lag and conflict/consistency behavior rather than assuming near-synchronous writes.
3. Cost Sensitivity
Many Malaysian SMEs are cost-sensitive, so the total cost delta between active-passive and active-active is significant and should be modeled explicitly:
- Active-passive (warm standby): typically lower because some DR compute can be scaled down or kept inactive
- Active-active: typically higher because both regions run production-grade capacity, monitoring, security, and operational processes
- Annual difference: calculate from the actual workload, region, licensing, support, network egress, backup, and operational staffing assumptions
For many SMEs, the active-active premium is only justified when the business impact of downtime clearly exceeds the additional run cost and operational complexity.
4. Operational Capability
Active-active requires more operational maturity. You must:
- Maintain identical infrastructure in two regions.
- Test and validate traffic routing regularly.
- Manage application state to be region-agnostic.
- Have staff available to respond to regional incidents.
- Monitor replication lag and DNS propagation.
Few Malaysian SMEs have dedicated DR teams. Azure Site Recovery's one-click failover test capability is a significant advantage for teams with limited bandwidth.
Decision Framework: Which Architecture for Which Scenario?
Active-Passive Is the Right Choice When:
- Your RTO requirement is 15 minutes or more. Most Malaysian business applications can tolerate 30-60 minutes of downtime. Active-passive delivers this reliably.
- Your budget is constrained. Active-passive costs 28-40% of the primary region, not 100%. The savings are real.
- Your IT team is small (1-3 people). Active-passive is simpler to operate. You can test failover quarterly without deep architectural knowledge.
- Your workload is not latency-sensitive across regions. If users are primarily in Malaysia, having the DR region in Singapore is fine for failover scenarios.
- You need PDPA compliance. You can keep personal data in Malaysia West only and replicate only non-sensitive data to the DR region.
Active-Active Is the Right Choice When:
- Your RTO requirement is under 1 hour for database failover. With forced failover (potential data loss), active-active can achieve sub-minute infrastructure failover. For automatic failover with data safety, expect at least 1 hour due to the grace period.
- Your workload has users in multiple regions. If you serve customers in Malaysia, Singapore, and Indonesia simultaneously, active-active serves them from the nearest region.
- Your SLA requires very high availability across regional failures. Active-active can support stronger regional resilience targets, but the actual SLA depends on the complete architecture, service SKUs, application design, and operational process.
- Your budget supports the additional run cost. If the business impact of downtime is high enough, the added active-active cost may be justified by avoided outage impact.
- Your team has DevOps capability. You need the operational maturity to maintain symmetric deployments across two regions.
The Compromise: Active-Passive with Readable DR
A hybrid approach that many enterprises overlook: configure active-passive but keep the DR database readable. Users who explicitly opt in (via a "read-only mode" link or during maintenance windows) can query the DR region without impacting production. This gives you:
- The cost structure of active-passive, subject to the actual secondary database and read workload sizing
- The ability to run read-heavy reporting queries against the DR database without impacting production
- Warmer standby — if failover is needed, the database is already online and can be promoted in minutes
Failover Testing Playbook
Regardless of which architecture you choose, test your failover at least quarterly. Here is a practical playbook:
Pre-Test (Week Before)
- Notify stakeholders: "We are conducting a scheduled DR test on [date]. Expected [n] hours of read-only mode."
- Take a full backup of critical databases (as a safety net).
- Review the last test's findings and confirm all remediation items are complete.
Test Day
- Active-passive test:
# Initiate planned failover via ASR
# Trigger planned failover via ARM REST API (no direct CLI command for this operation)
az rest --method POST \
--uri "https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/rg-dr/providers/Microsoft.RecoveryServices/vaults/asr-vault-southeastasia/replicationRecoveryPlans/dr-plan-web-app/plannedFailover?api-version=2024-10-01" \
--body '{ "properties": { "failoverDirection": "PrimaryToRecovery" } }
# Verify application is accessible from DR region
curl -f https://dr-app-wenfeng.azurewebsites.net/health
# Promote SQL secondary (for geo-replication setups)
az sql failover-group set-primary \
--resource-group rg-db \
--server sql-southeastasia \
--name fg-app-db
- Run application smoke tests against the DR endpoint.
- Measure actual RTO and RPO against your SLAs.
Post-Test
- Fail back to the primary region.
- Document actual RTO/RPO achieved vs. targets.
- Identify gaps: "Restored VM had outdated SSL certificate," "DNS TTL caused 10-minute propagation delay."
- Fix gaps before the next test.
Key Takeaways
- Active-passive is the default for many Malaysian enterprises — it can provide practical recovery objectives at materially lower cost than fully active-active, provided failover is tested.
- Active-active is for high-availability requirements — it can reduce recovery time, but the cost premium and operational burden are workload-specific.
- Data residency complicates active-active — Malaysia's PDPA requires careful evaluation of what data crosses the border to Singapore.
- Test failover quarterly without exception — a DR plan never tested is not a DR plan. Use Azure Site Recovery's test failover in an isolated VNet.
- Azure Site Recovery simplifies active-passive — automated replication, one-click failover testing, and built-in recovery plans make it accessible to teams without dedicated DR expertise.
- The hybrid approach (active-passive with readable DR database) is a practical middle ground — warm standby cost structure with operational flexibility.
DR architecture is not a one-size-fits-all decision. I review and design disaster recovery strategies for Malaysian enterprises — covering Azure Site Recovery, active-passive, active-active, and region-pair failover patterns. Message me on LinkedIn.