Azure Disaster Recovery Architecture: Active-Active vs Active-Passive for Malaysian Enterprises

Disaster recovery on Azure requires a fundamental architectural decision: should you run your workload in one region with a standby (active-passive) or in multiple regions simultaneously (active-active)? For Malaysian enterprises, this choice is complicated by data residency requirements in Malaysia West, the available regional pair architecture, and the cost sensitivity typical of the local market.

This article examines both architectures in detail, provides concrete implementation guidance, and helps you decide which is right for your specific RTO, RPO, and budget constraints.

Understanding the Regional Design Constraint

Azure offers Malaysia West as a primary region, but Microsoft lists Malaysia West with no Azure paired region. Southeast Asia (Singapore) is still a common nearby DR candidate, but it should not be described as Malaysia West's formal paired region.

This distinction matters because:

Do not assume paired-region behavior — Microsoft documents benefits for formal Azure region pairs, such as staggered platform updates and service-specific geo-redundancy patterns, but those assumptions should be validated per service when the selected secondary is not a listed pair.
Data residency is not automatic — Malaysia West to Southeast Asia is a cross-border DR design. Personal data replicated to Singapore may trigger Malaysia PDPA cross-border transfer obligations. You must classify the data, check contractual controls, and confirm whether consent or another permitted transfer basis applies.
Latency must be measured for the workload — Malaysia-to-Singapore latency is often acceptable for web failover, but it should be benchmarked for each application. Avoid synchronous database write patterns across regions unless the application has been explicitly designed and tested for that latency.

In short: treat Southeast Asia as a practical secondary region candidate, not as an automatic paired-region target. The design decision must be based on service support, compliance, latency, and recovery objectives.

Active-Passive (Cold/Warm Standby) Architecture

In an active-passive architecture, one Azure region handles all production traffic. The DR region has infrastructure deployed and ready, but it remains idle — or running at minimal capacity — until failover is triggered.

How It Works

    Malaysia West              Southeast Asia (Singapore)
    ┌─────────────────┐       ┌──────────────────────┐
    │  Active (100%)  │       │  Passive (Standby)   │
    │                 │       │                      │
    │  App Gateway    │       │  App Gateway (off)   │
    │  App Service    │──────>│  App Service (cold)  │
    │  SQL Database   │──────>│  SQL Database (DR)   │
    │  Blob Storage   │──────>│  Blob Storage (GRS)  │
    └─────────────────┘       └──────────────────────┘

Traffic flows to Malaysia West normally. Data is replicated to Southeast Asia asynchronously. During a disaster, you initiate failover: traffic is redirected, the standby resources are activated, and the application resumes from the DR region.

RTO and RPO Expectations

Component	RTO	RPO
Azure SQL Database (Geo-restore)	Minutes to hours	Around 1 hour, depending on backup timing
Azure SQL Database (Active Geo-replication / failover groups)	Typically less than 60 seconds for database failover, excluding dependent application components	Asynchronous replication; data loss can be greater than zero if recent changes have not replicated
Azure VMs (Azure Site Recovery)	Depends on recovery plan, VM boot time, DNS/routing, and validation steps	Depends on configured replication and latest available recovery point
Azure Blob/Storage (GRS or RA-GRS)	Depends on account failover and application design	Geo-replication is asynchronous; Microsoft documents Block Blob geo priority replication RPO as less than or equal to 15 minutes where that feature applies
App Service (pre-provisioned DR deployment)	Depends on traffic manager/front-door configuration and app warm-up	Varies by application state and deployment model

The ranges are wide because RTO and RPO depend on the specific replication technologies you choose. Azure SQL failover groups and active geo-replication can provide fast database failover, but they still use asynchronous replication and do not remove the need to design the application, identity, storage, networking, and DNS layers for recovery.

Cost Modeling: Active-Passive for a Web Application

A typical three-tier web application (web tier, API tier, database) in an active-passive DR setup:

Resource	Primary Region	DR Region (Passive)
App Service / API compute	Production sizing	Stopped, scaled down, or pre-provisioned depending on RTO
SQL Database	Production sizing	Geo-secondary or failover group secondary sized to recovery/read needs
Azure Front Door / routing	Global service	Same global instance, with secondary origin configured
Azure Site Recovery	Protected VM scope	Replication, storage, and recovery-plan configuration
Storage	Primary redundancy choice	GRS/RA-GRS/GZRS or explicit replication pattern, subject to data residency

Illustrative active-passive model: the DR footprint is usually materially cheaper than a fully active second region because some compute can be stopped, scaled down, or kept as infrastructure-as-code until failover. Treat any percentage as workload-specific rather than universal.

For pricing, validate the actual SKU, region, licensing benefit, backup retention, data transfer, and support plan in Azure Pricing Calculator before using the figures in a proposal or budget.

Implementation: Active-Passive with Azure Site Recovery

Azure Site Recovery (ASR) orchestrates the replication, failover, and failback of Azure VMs between regions.

# Implementation note:
# Azure Site Recovery for Azure-to-Azure VM replication is usually configured
# from the Azure portal or automation generated after the Recovery Services
# vault, fabric, protection container, policy, network mapping, and protected
# item relationships exist. Do not treat the following as a copy-paste script.

# Verify the Site Recovery extension is available in the installed Azure CLI.
az extension add --name site-recovery
az site-recovery --help

# Recommended operational pattern:
# 1. Create or select a Recovery Services vault in the DR region.
# 2. Enable replication for each VM and confirm target network, subnet,
#    VM size, disk type, and availability-zone settings.
# 3. Build a recovery plan grouping app, API, and database tiers.
# 4. Run test failover into an isolated VNet at least quarterly.
# 5. Record actual RTO/RPO from the test, not just design targets.

Implementation: Active-Passive with Azure SQL Geo-Replication

For a better RPO than ASR, configure Active Geo-Replication on Azure SQL Database:

-- On primary database in Malaysia West
ALTER DATABASE [app-db]
    ADD SECONDARY ON SERVER [sql-srv-southeastasia]
    WITH (ALLOW_CONNECTIONS = ALL, SECONDARY_TYPE = GEO,
          SERVICE_OBJECTIVE = GP_Gen5_2);

-- During failover, promote the secondary to primary
-- (run this on the secondary server)
ALTER DATABASE [app-db] FAILOVER;

The SQL driver can also handle automatic failover redirection. Use ApplicationIntent=ReadOnly in your connection string to route read-only queries to the readable secondary.

Active-Active (Geo-Redundant) Architecture

In an active-active architecture, both regions handle traffic simultaneously. Traffic is load-balanced across regions using Azure Traffic Manager or Azure Front Door. Each region runs identical infrastructure and serves user requests.

How It Works

                      ┌──────────────────────┐
                      │  Azure Front Door     │
                      │  (Global Load Balancer)│
                      └──────┬───────────────┘
                             │
                ┌────────────┴────────────┐
                │                         │
    Malaysia West                  Southeast Asia
    ┌──────────────────┐       ┌──────────────────────┐
    │  App Service      │       │  App Service          │
    │  (50% traffic)    │       │  (50% traffic)        │
    │                   │       │                       │
    │  SQL (Read-Write) │<─────>│  SQL (Read-Only)      │
    │  Blob (LRS)       │       │  Blob (LRS)           │
    └──────────────────┘       └──────────────────────┘

For active-active to work, the application must be stateless (session state must go to Redis or Cosmos DB), and the database must support multi-region writes or accept that writes happen in one region and are replicated to the other.

RTO and RPO Expectations

Component	RTO	RPO
Azure Front Door routing	Often fast when origins and health probes are already configured	N/A
Azure SQL active geo-replication / failover group	Microsoft documents database disaster-recovery RTO as typically less than 60 seconds for customer-managed failover, excluding dependent components	Asynchronous replication; possible data loss for unreplicated recent writes
Application state	Near-zero only if the application is stateless or uses a region-aware state store	Depends on state-store replication model
DNS propagation	Avoided if Azure Front Door or another global routing layer remains the stable entry point	N/A

Active-active can reduce infrastructure recovery time because compute is already online in the secondary region. However, database failover, replication lag, identity, secrets, dependent services, and application consistency still determine the real business RTO/RPO.

Cost Modeling: Active-Active for the Same Web Application

Resource	Primary Region	DR Region (Active)
App Service / API compute	Production capacity	Production or near-production capacity
SQL Database	Primary database	Secondary database / failover group replica sized for recovery and read traffic
Azure Front Door / routing	Global service	Same global instance with both origins active or priority-routed
Redis / application state	Primary state tier	Region-aware state design or replicated/cache-warm strategy
Storage	Primary storage pattern	Secondary storage pattern selected for consistency, latency, and compliance

Illustrative active-active model: expect a materially higher monthly cost than active-passive because compute, observability, security, and operational testing run in both regions. The premium is workload-specific and should be priced in Azure Pricing Calculator before committing to a budget.

The extra cost comes from running more resources continuously, not from a fixed universal percentage.

Implementation: Active-Active with Azure Front Door and SQL Auto-Failover

# Azure Front Door with priority routing (primary, then secondary)
az afd profile create \
    --resource-group rg-global \
    --profile-name afd-wenfeng \
    --sku Premium_AzureFrontDoor \
    --location global

az afd endpoint create \
    --resource-group rg-global \
    --profile-name afd-wenfeng \
    --endpoint-name api-global \
    --enabled-state Enabled

az afd origin-group create \
    --resource-group rg-global \
    --profile-name afd-wenfeng \
    --origin-group-name origins-southeastasia \
    --probe-request-type GET \
    --probe-protocol Http \
    --probe-interval-in-seconds 30 \
    --probe-path /health

# Add Malaysia West as primary origin
az afd origin create \
    --resource-group rg-global \
    --profile-name afd-wenfeng \
    --origin-group-name origins-southeastasia \
    --origin-name app-malaysiawest \
    --host-name app-wenfeng-malaysiawest.azurewebsites.net \
    --priority 1  # Primary

# Add Southeast Asia as secondary origin
az afd origin create \
    --resource-group rg-global \
    --profile-name afd-wenfeng \
    --origin-group-name origins-southeastasia \
    --origin-name app-southeastasia \
    --host-name app-wenfeng-southeastasia.azurewebsites.net \
    --priority 2  # Secondary (hot standby in active-passive, active in active-active)

-- SQL Auto-Failover Group for database-level failover
-- Managed via Azure CLI:
az sql failover-group create \
    --resource-group rg-db \
    --server sql-malaysiawest \
    --partner-server sql-southeastasia \
    --name fg-app-db \
    --failover-policy Automatic \
    --grace-period 1 \
    --add-db app-db

-- Note: With --grace-period 1 (1 hour minimum), automatic failover
-- takes at least 1 hour. Use forced failover (with data loss) for faster failover.
-- The failover group provides a single read-write endpoint
-- and a read-only listener endpoint:
-- Read-Write: fg-app-db.database.windows.net
-- Read-Only:  fg-app-db.secondary.database.windows.net

Malaysian Enterprise Considerations

1. Data Residency and PDPA

Malaysia's PDPA 2010 restricts transfers of personal data to places outside Malaysia unless an allowed transfer basis applies, such as consent, contractual necessity, or taking reasonable precautions and due diligence to ensure the data will not be processed in a way that would contravene the PDPA. This has direct implications:

Active-passive with ASR — if data is replicated to Southeast Asia (Singapore) for DR purposes, you may be transferring personal data out of Malaysia. Mitigate by: (a) identifying which data is personal data and excluding it from replication where feasible, (b) confirming the lawful transfer basis with legal counsel, and (c) using appropriate Microsoft contractual, security, and organizational controls.
Active-active with geo-replication — the same concern applies but is magnified because data is actively written in both regions. A stricter data classification exercise is required.
Mitigation strategies:
Use Azure Policy to tag resources containing personal data and exclude them from regional replication.
Implement a data classification framework before designing your DR architecture — separate personal data from non-personal data at the application level.
Consider keeping personal data in a separate database in Malaysia West only, while replicating non-personal application data for DR.

2. Network Latency

Malaysia West to Southeast Asia (Singapore) latency should be measured from the actual application network path. For architecture planning, assume it can materially affect synchronous write paths and validate with Azure Network Watcher, application telemetry, or direct synthetic tests.

Workload Type	Cross-region latency impact
Static web page serving	Usually low if content is cached or served from the nearest edge
API requests with multiple back-end round trips	Can become noticeable; measure end-to-end transaction latency
Real-time collaboration	Potentially problematic without region-aware design
Database synchronous replication	Avoid unless the platform and application explicitly support the latency profile
Video/voice conferencing	Usually better handled with edge/media services rather than cross-region application round trips

For active-active architectures, replication lag is a critical metric. Azure SQL active geo-replication is asynchronous, so design for possible lag and conflict/consistency behavior rather than assuming near-synchronous writes.

3. Cost Sensitivity

Many Malaysian SMEs are cost-sensitive, so the total cost delta between active-passive and active-active is significant and should be modeled explicitly:

Active-passive (warm standby): typically lower because some DR compute can be scaled down or kept inactive
Active-active: typically higher because both regions run production-grade capacity, monitoring, security, and operational processes
Annual difference: calculate from the actual workload, region, licensing, support, network egress, backup, and operational staffing assumptions

For many SMEs, the active-active premium is only justified when the business impact of downtime clearly exceeds the additional run cost and operational complexity.

4. Operational Capability

Active-active requires more operational maturity. You must:

Maintain identical infrastructure in two regions.
Test and validate traffic routing regularly.
Manage application state to be region-agnostic.
Have staff available to respond to regional incidents.
Monitor replication lag and DNS propagation.

Few Malaysian SMEs have dedicated DR teams. Azure Site Recovery's one-click failover test capability is a significant advantage for teams with limited bandwidth.

Decision Framework: Which Architecture for Which Scenario?

Active-Passive Is the Right Choice When:

Your RTO requirement is 15 minutes or more. Most Malaysian business applications can tolerate 30-60 minutes of downtime. Active-passive delivers this reliably.
Your budget is constrained. Active-passive costs 28-40% of the primary region, not 100%. The savings are real.
Your IT team is small (1-3 people). Active-passive is simpler to operate. You can test failover quarterly without deep architectural knowledge.
Your workload is not latency-sensitive across regions. If users are primarily in Malaysia, having the DR region in Singapore is fine for failover scenarios.
You need PDPA compliance. You can keep personal data in Malaysia West only and replicate only non-sensitive data to the DR region.

Active-Active Is the Right Choice When:

Your RTO requirement is under 1 hour for database failover. With forced failover (potential data loss), active-active can achieve sub-minute infrastructure failover. For automatic failover with data safety, expect at least 1 hour due to the grace period.
Your workload has users in multiple regions. If you serve customers in Malaysia, Singapore, and Indonesia simultaneously, active-active serves them from the nearest region.
Your SLA requires very high availability across regional failures. Active-active can support stronger regional resilience targets, but the actual SLA depends on the complete architecture, service SKUs, application design, and operational process.
Your budget supports the additional run cost. If the business impact of downtime is high enough, the added active-active cost may be justified by avoided outage impact.
Your team has DevOps capability. You need the operational maturity to maintain symmetric deployments across two regions.

The Compromise: Active-Passive with Readable DR

A hybrid approach that many enterprises overlook: configure active-passive but keep the DR database readable. Users who explicitly opt in (via a "read-only mode" link or during maintenance windows) can query the DR region without impacting production. This gives you:

The cost structure of active-passive, subject to the actual secondary database and read workload sizing
The ability to run read-heavy reporting queries against the DR database without impacting production
Warmer standby — if failover is needed, the database is already online and can be promoted in minutes

Failover Testing Playbook

Regardless of which architecture you choose, test your failover at least quarterly. Here is a practical playbook:

Pre-Test (Week Before)

Notify stakeholders: "We are conducting a scheduled DR test on [date]. Expected [n] hours of read-only mode."
Take a full backup of critical databases (as a safety net).
Review the last test's findings and confirm all remediation items are complete.

Test Day

Active-passive test:

# Initiate planned failover via ASR
# Trigger planned failover via ARM REST API (no direct CLI command for this operation)
az rest --method POST \
    --uri "https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/rg-dr/providers/Microsoft.RecoveryServices/vaults/asr-vault-southeastasia/replicationRecoveryPlans/dr-plan-web-app/plannedFailover?api-version=2024-10-01" \
    --body '{ "properties": { "failoverDirection": "PrimaryToRecovery" } }

# Verify application is accessible from DR region
curl -f https://dr-app-wenfeng.azurewebsites.net/health

# Promote SQL secondary (for geo-replication setups)
az sql failover-group set-primary \
    --resource-group rg-db \
    --server sql-southeastasia \
    --name fg-app-db

Run application smoke tests against the DR endpoint.
Measure actual RTO and RPO against your SLAs.

Post-Test

Fail back to the primary region.
Document actual RTO/RPO achieved vs. targets.
Identify gaps: "Restored VM had outdated SSL certificate," "DNS TTL caused 10-minute propagation delay."
Fix gaps before the next test.

Key Takeaways

Active-passive is the default for many Malaysian enterprises — it can provide practical recovery objectives at materially lower cost than fully active-active, provided failover is tested.
Active-active is for high-availability requirements — it can reduce recovery time, but the cost premium and operational burden are workload-specific.
Data residency complicates active-active — Malaysia's PDPA requires careful evaluation of what data crosses the border to Singapore.
Test failover quarterly without exception — a DR plan never tested is not a DR plan. Use Azure Site Recovery's test failover in an isolated VNet.
Azure Site Recovery simplifies active-passive — automated replication, one-click failover testing, and built-in recovery plans make it accessible to teams without dedicated DR expertise.
The hybrid approach (active-passive with readable DR database) is a practical middle ground — warm standby cost structure with operational flexibility.

DR architecture is not a one-size-fits-all decision. I review and design disaster recovery strategies for Malaysian enterprises — covering Azure Site Recovery, active-passive, active-active, and region-pair failover patterns. Message me on LinkedIn.

Azure Disaster Recovery Architecture: Active-Active vs Active-Passive for Malaysian Enterprises

Understanding the Regional Design Constraint

Active-Passive (Cold/Warm Standby) Architecture

How It Works

RTO and RPO Expectations

Cost Modeling: Active-Passive for a Web Application

Implementation: Active-Passive with Azure Site Recovery

Implementation: Active-Passive with Azure SQL Geo-Replication

Active-Active (Geo-Redundant) Architecture

How It Works

RTO and RPO Expectations

Cost Modeling: Active-Active for the Same Web Application

Implementation: Active-Active with Azure Front Door and SQL Auto-Failover

Malaysian Enterprise Considerations

1. Data Residency and PDPA

2. Network Latency

3. Cost Sensitivity

4. Operational Capability

Decision Framework: Which Architecture for Which Scenario?

Active-Passive Is the Right Choice When:

Active-Active Is the Right Choice When:

The Compromise: Active-Passive with Readable DR

Failover Testing Playbook

Pre-Test (Week Before)

Test Day

Post-Test

Key Takeaways

Enjoyed this article?