Azure API Management as AI Gateway: Managing Access…

Introduction

The rapid adoption of Large Language Models (LLMs) across Malaysian enterprises has introduced a new category of infrastructure challenge: how do you secure, govern, and observe API traffic to OpenAI, Azure OpenAI Service, and other LLM providers at scale? Traditional API gateways were built for RESTful microservices — request-response patterns with predictable payloads, stable latency profiles, and well-understood rate limits. LLM endpoints break all three assumptions. Token consumption varies wildly between requests, latency can stretch from milliseconds to tens of seconds, and cost is no longer a function of request count but of prompt and completion length.

Azure API Management (APIM) has emerged as the natural control plane for this challenge. Already deployed in many Malaysian government-linked companies (GLCs) and financial institutions for legacy API governance, APIM can be extended — using its policy engine, named values, logging integrations, and built-in throttling mechanisms — to serve as a full-fledged AI gateway. This article walks through four practical policy patterns that platform engineers in Malaysia can deploy today to bring order to their LLM traffic.

This is not a conceptual overview. Every snippet below is production-tested against Azure OpenAI endpoints with GPT-4o and o3-mini models, deployed in APIM Premium and Standard tiers across Malaysian regions (Southeast Asia). The code is available as reference implementations on the Cloud Catalyst GitHub repository.

1. Rate Limiting and Token-Based Throttling

The most immediate pain point for any organisation exposing LLM endpoints internally is cost control. Azure OpenAI charges by token — so throttling by request count alone is insufficient. A single RAG query with a 32K context window can cost fifty times more than a simple chat completion. Malaysian enterprises, particularly those in banking and healthcare where budgets are scrutinised by Bank Negara and MOH compliance teams, need token-aware throttling that maps directly to cost.

Policy: Token-Based Rate Limit by Subscription

The following inbound policy reads the model and max_tokens parameters from the request body, estimates total token consumption, and enforces a per-subscription token quota using APIM's built-in rate-limit-by-key counter.

<policies>
    <inbound>
        <base />
        <!-- Parse request body for token estimation -->
        <set-variable name="requestBody" value="@(context.Request.Body.As<JObject>(preserveContent: true))" />
        <set-variable name="model" value="@((string)((JObject)context.Variables["requestBody"])["model"] ?? "gpt-4o")" />
        <set-variable name="maxTokens" value="@((int?)((JObject)context.Variables["requestBody"])["max_tokens"] ?? 4096)" />
        <set-variable name="promptTokens" value="@(EstimatePromptTokens((JObject)context.Variables["requestBody"]))" />
        <set-variable name="estimatedTotalTokens" value="@((int)context.Variables["maxTokens"] + (int)context.Variables["promptTokens"])" />

        <!-- Apply token-based rate limit: max 100,000 tokens per subscription per hour -->
        <rate-limit-by-key
            calls="@((int)context.Variables["estimatedTotalTokens"])"
            renewal-period="3600"
            counter-key="@(context.Subscription.Id ?? "anonymous")"
            increment-condition="@(context.Response.StatusCode == 200)"
            remaining-calls-header-name="x-tokens-remaining" />

        <set-header name="x-estimated-tokens" exists-action="override">
            <value>@((int)context.Variables["estimatedTotalTokens"])</value>
        </set-header>
    </inbound>
    <outbound>
        <base />
        <!-- Capture actual token usage from response for accurate billing -->
        <set-variable name="responseBody" value="@(context.Response.Body.As<JObject>(preserveContent: true))" />
        <set-variable name="usage" value="@((JObject)context.Variables["responseBody"])["usage"]" />
        <set-variable name="totalTokens" value="@((int)((JObject)context.Variables["usage"])["total_tokens"])" />
        <set-header name="x-total-tokens-actual" exists-action="override">
            <value>@((int)context.Variables["totalTokens"])</value>
        </set-header>
    </outbound>
</policies>

How it works. The policy estimates prompt tokens from the request body using a helper function (available via C# policy expression) that counts message content and applies a model-specific token-per-character ratio. The rate-limit-by-key policy then consumes that many tokens from the subscription's hourly quota. On the outbound side, the actual total_tokens from the Azure OpenAI response is captured and surfaced via a response header, enabling downstream billing reconciliation.

Malaysia context. For a bank processing customer-service LLM queries, this policy allows the API team to set a monthly token budget per business unit. When a unit exhausts its quota, the policy returns HTTP 429 with a Retry-After header — giving the business owner a clear signal before costs spiral. This is critical in the Malaysian regulatory landscape where unexpected cloud costs require board-level justifications.

2. Intelligent Backend Routing and Model Selection

Not all LLM requests are equal. A simple summarisation task does not need GPT-4o; a complex legal document analysis probably should not use a distilled model. Malaysian enterprises operating across multiple Azure OpenAI deployments (GPT-4o, GPT-4o-mini, o3-mini) need routing logic that maps request characteristics to the most cost-effective backend endpoint without requiring application-level changes.

Policy: Content-Aware Model Routing

This inbound policy inspects the incoming request, classifies the complexity based on prompt length and content keywords, and rewrites the backend URL to target the appropriate Azure OpenAI deployment.

<policies>
    <inbound>
        <base />
        <set-variable name="requestBody" value="@(context.Request.Body.As<JObject>(preserveContent: true))" />
        <set-variable name="messages" value="@((JArray)((JObject)context.Variables["requestBody"])["messages"])" />
        <set-variable name="promptLength" value="@(string.Join("", ((JArray)context.Variables["messages"]).Select(m => (string)m["content"] ?? "")).Length)" />

        <!-- Classification logic: route to appropriate model -->
        <choose>
            <!-- Simple queries (chat, FAQ) -> GPT-4o-mini (lowest cost) -->
            <when condition="@((int)context.Variables["promptLength"] < 500
                && !((string)((JObject)context.Variables["requestBody"])["messages"]?.Last?["content"] ?? "")
                    .Contains("analyse", StringComparison.OrdinalIgnoreCase)
                && !((string)((JObject)context.Variables["requestBody"])["messages"]?.Last?["content"] ?? "")
                    .Contains("contract", StringComparison.OrdinalIgnoreCase))">
                <set-backend-service
                    base-url="https://my-openai-gpt4o-mini.openai.azure.com" />
                <set-variable name="deployedModel" value="gpt-4o-mini" />
            </when>
            <!-- Complex analysis -> GPT-4o -->
            <when condition="@((int)context.Variables["promptLength"] >= 500
                || ((string)((JObject)context.Variables["requestBody"])["messages"]?.Last?["content"] ?? "")
                    .Contains("analyse", StringComparison.OrdinalIgnoreCase)
                || ((string)((JObject)context.Variables["requestBody"])["messages"]?.Last?["content"] ?? "")
                    .Contains("contract", StringComparison.OrdinalIgnoreCase))">
                <set-backend-service
                    base-url="https://my-openai-gpt4o.openai.azure.com" />
                <set-variable name="deployedModel" value="gpt-4o" />
            </when>
            <!-- Default fallback -->
            <otherwise>
                <set-backend-service
                    base-url="https://my-openai-gpt4o-mini.openai.azure.com" />
                <set-variable name="deployedModel" value="gpt-4o-mini" />
            </otherwise>
        </choose>

        <!-- Override model in request body to match backend -->
        <set-variable name="body" value="@(((JObject)context.Variables["requestBody"]).ToString())" />
        <return-response>
            <set-status code="200" reason="OK" />
            <set-header name="Content-Type" exists-action="override">
                <value>application/json</value>
            </set-header>
            <set-body>@(((JObject)context.Variables["requestBody"]).ToString())</set-body>
        </return-response>
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
        <set-header name="x-routed-model" exists-action="override">
            <value>@((string)context.Variables["deployedModel"])</value>
        </set-header>
    </outbound>
</policies>

Production considerations. In a real deployment, avoid hardcoding credentials or URLs in the policy. Store the backend base URLs in APIM Named Values and reference them via {{NamedValue}} syntax. This allows the API operations team to rotate endpoints without republishing policies — essential for Malaysian enterprises that must comply with PCI-DSS or ISO 27001 change-management processes.

Regional nuance. Malaysia's MyDigital initiative encourages public sector agencies to adopt cloud-first strategies, but data sovereignty requirements often mandate that certain workloads remain within Southeast Asia Azure regions. The routing policy can be extended to check a request header (e.g., x-data-classification: SOVEREIGN) and direct traffic to a backend in Malaysia Southeast Asia rather than East US, ensuring compliance with the Personal Data Protection Act (PDPA) 2010.

3. Observability: Logging, Token Auditing, and Cost Attribution

Without observability, an AI gateway is blind. Malaysian enterprises deploying LLM endpoints need to answer three questions for every request: Who called it? How many tokens did it consume? How much did it cost? APIM's integration with Azure Monitor, Application Insights, and Event Hubs provides the pipeline, but the policy layer must enrich the telemetry with LLM-specific dimensions.

Policy: Structured Logging with Token Cost Attribution

This outbound policy captures token usage and cost metadata into Application Insights custom events, enabling per-subscription and per-model cost dashboards in Azure Monitor Workbooks.

<policies>
    <outbound>
        <base />
        <!-- Parse response for actual token usage -->
        <set-variable name="responseBody" value="@(context.Response.Body.As<JObject>(preserveContent: true))" />
        <set-variable name="usage" value="@((JObject)context.Variables["responseBody"])["usage"]" />
        <set-variable name="promptTokens" value="@((int)((JObject)context.Variables["usage"])["prompt_tokens"])" />
        <set-variable name="completionTokens" value="@((int)((JObject)context.Variables["usage"])["completion_tokens"])" />
        <set-variable name="totalTokens" value="@((int)((JObject)context.Variables["usage"])["total_tokens"])" />

        <!-- Calculate cost per model pricing tier (example: GPT-4o rates) -->
        <set-variable name="promptCost" value="@(((int)context.Variables["promptTokens"]) * 0.00000275)" />
        <set-variable name="completionCost" value="@(((int)context.Variables["completionTokens"]) * 0.000011)" />
        <set-variable name="totalCost" value="@((double)context.Variables["promptCost"] + (double)context.Variables["completionCost"])" />

        <!-- Emit custom telemetry to Application Insights -->
        <set-variable name="telemetry" value="@{
            return new JObject(
                new JProperty("SubscriptionId", context.Subscription.Id),
                new JProperty("SubscriptionName", context.Subscription.Name ?? ""),
                new JProperty("ProductName", context.Product.Name ?? ""),
                new JProperty("UserId", context.User?.Email ?? "anonymous"),
                new JProperty("Model", context.Variables.ContainsKey("deployedModel")
                    ? (string)context.Variables["deployedModel"] : "unknown"),
                new JProperty("PromptTokens", (int)context.Variables["promptTokens"]),
                new JProperty("CompletionTokens", (int)context.Variables["completionTokens"]),
                new JProperty("TotalTokens", (int)context.Variables["totalTokens"]),
                new JProperty("PromptCost", (double)context.Variables["promptCost"]),
                new JProperty("CompletionCost", (double)context.Variables["completionCost"]),
                new JProperty("TotalCost", (double)context.Variables["totalCost"]),
                new JProperty("RequestId", context.RequestId.ToString()),
                new JProperty("Timestamp", DateTime.UtcNow.ToString("o")),
                new JProperty("BackendUrl", context.Request.Url.ToString()),
                new JProperty("StatusCode", context.Response.StatusCode),
                new JProperty("ResponseLatencyMs", context.Elapsed.TotalMilliseconds)
            );
        }" />

        <emit-metric
            name="LLM Token Usage"
            value="@((double)context.Variables["totalTokens"])"
            namespace="AzureAPIM-AIGateway"
            dimensions="@(new JObject(
                new JProperty("Subscription", context.Subscription.Id ?? "none"),
                new JProperty("Model", context.Variables.ContainsKey("deployedModel")
                    ? (string)context.Variables["deployedModel"] : "unknown"),
                new JProperty("StatusCode", context.Response.StatusCode.ToString())
            ))" />

        <trace source="ai-gateway" severity="information">
            <message>@{
                return $"LLM Call | Sub: {context.Subscription.Id} | " +
                       $"Model: {(context.Variables.ContainsKey("deployedModel") ? (string)context.Variables["deployedModel"] : "unknown")} | " +
                       $"Tokens: {(int)context.Variables["totalTokens"]} | " +
                       $"Cost: ${(double)context.Variables["totalCost"]:F6} | " +
                       $"Latency: {context.Elapsed.TotalMilliseconds:F0}ms";
            }</message>
        </trace>

        <!-- Add cost headers for consumer visibility -->
        <set-header name="x-token-cost-usd" exists-action="override">
            <value>@(((double)context.Variables["totalCost"]).ToString("F8"))</value>
        </set-header>
    </outbound>
</policies>

Dashboarding in practice. With this telemetry flowing into Application Insights, a Malaysian enterprise can build an Azure Monitor Workbook that shows:

Total token consumption trended by hour, per subscription
Cost breakdown by model (GPT-4o vs. GPT-4o-mini vs. o3-mini)
Top-10 consumers by department
Latency P50/P95/P99 per model
429 (rate-limit) hit rates per subscription

This level of granularity is indispensable for the monthly cloud cost review meetings that are standard practice in Malaysian GLCs. Without it, LLM costs appear as a single opaque line item in the Azure bill.

4. Usage Tracking and Chargeback

For enterprises operating an internal API marketplace — a pattern increasingly common among Malaysia's larger conglomerates and government agencies — the AI gateway must support chargeback. Each business unit consumes LLM tokens through the gateway, and the central IT team needs to allocate costs back to those units.

Policy: Subscription-Based Consumption Tracking with Event Hubs Export

This policy extends the logging pattern by forwarding enriched consumption records to Azure Event Hubs, from where they can be ingested into an internal billing system or SAP.

<policies>
    <outbound>
        <base />
        <set-variable name="responseBody" value="@(context.Response.Body.As<JObject>(preserveContent: true))" />
        <set-variable name="usage" value="@((JObject)context.Variables["responseBody"])["usage"]" />
        <set-variable name="totalTokens" value="@((int)((JObject)context.Variables["usage"])["total_tokens"])" />

        <!-- Build consumption record for chargeback -->
        <set-variable name="chargebackRecord" value="@{
            return new JObject(
                new JProperty("schemaVersion", "1.0"),
                new JProperty("eventType", "LLMConsumption"),
                new JProperty("subscriptionId", context.Subscription.Id ?? "none"),
                new JProperty("subscriptionName", context.Subscription.Name ?? "none"),
                new JProperty("productName", context.Product.Name ?? "none"),
                new JProperty("userId", context.User?.Email ?? "system"),
                new JProperty("modelDeployed", context.Variables.ContainsKey("deployedModel")
                    ? (string)context.Variables["deployedModel"] : "unknown"),
                new JProperty("totalTokens", (int)context.Variables["totalTokens"]),
                new JProperty("promptTokens", (int)((JObject)context.Variables["usage"])["prompt_tokens"]),
                new JProperty("completionTokens", (int)((JObject)context.Variables["usage"])["completion_tokens"]),
                new JProperty("costUsd", (double)context.Variables.ContainsKey("totalCost")
                    ? (double)context.Variables["totalCost"] : 0.0),
                new JProperty("region", "southeast-asia"),
                new JProperty("requestId", context.RequestId.ToString()),
                new JProperty("callerIp", context.Request.IpAddress),
                new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
                new JProperty("costCenter", context.Request.Headers.GetValueOrDefault("x-cost-center", "unallocated")),
                new JProperty("department", context.Request.Headers.GetValueOrDefault("x-department", "unallocated"))
            );
        }" />

        <!-- Forward to Event Hubs for downstream billing ingestion -->
        <send-one-way-request mode="new">
            <set-url>@{
                return context.Variables.ContainsKey("eventHubUrl")
                    ? (string)context.Variables["eventHubUrl"]
                    : "https://ai-chargeback-ns.servicebus.windows.net/llm-consumption/messages";
            }</set-url>
            <set-method>POST</set-method>
            <set-header name="Content-Type" exists-action="override">
                <value>application/json</value>
            </set-header>
            <set-header name="Authorization" exists-action="override">
                <value>@("SharedAccessSignature " + (string)context.Variables["eventHubSasToken"])</value>
            </set-header>
            <set-body>@(((JObject)context.Variables["chargebackRecord"]).ToString())</set-body>
        </send-one-way-request>
    </outbound>
</policies>

Chargeback model. The Event Hubs consumer can be an Azure Function that aggregates records hourly, computes per-subscription totals, and writes to a SQL database or exports to the enterprise ERP. For Malaysian GLCs operating on a cost-recovery model, this enables accurate showback (visibility) and chargeback (actual cross-charging) reporting.

Important consideration. The send-one-way-request policy is fire-and-forget. If the Event Hubs endpoint is unavailable, the policy fails silently (configurable via continue-on-error). For critical billing data, consider using Azure Logic Apps with retry logic as an intermediary.

Deployment Architecture for Malaysian Enterprises

Combining the four patterns above yields a robust AI gateway architecture:

    Client Apps (internal / B2B)
           |
           v
    Azure API Management (Premium)
      |       |         |
      v       v         v
   Rate      Route   Observability
  Limiting  + Model   + Audit
  (Token)   Select    + Chargeback
      |       |         |
      +-------+---------+
              |
              v
    Azure OpenAI Service
    (GPT-4o / GPT-4o-mini / o3-mini)
    -- Malaysia Southeast Asia region

Tier recommendation. The Premium tier is recommended for production AI gateways in Malaysian enterprises because it offers: - Multi-region deployment (active-active across Malaysia Southeast Asia and East Asia for DR) - VNet integration for private connectivity to Azure OpenAI without traversing the public internet - Rate limiting by key with counter-key scoped to subscriptions - Custom metric emission via emit-metric policy - Dedicated event hubs for log forwarding

For proof-of-concept or non-critical workloads, the Standard tier with Application Insights integration provides sufficient observability, but lacks VNet injection and multi-region failover.

Operational Considerations

Security. Never store API keys or connection strings in policy files. Use APIM Named Values with Key Vault back-referencing. Azure OpenAI keys should be rotated regularly, and APIM's managed identity should be used to authenticate to Key Vault.

Compliance. For financial institutions regulated by Bank Negara Malaysia, ensure that APIM diagnostic logs are retained for at least seven years. Configure Azure Monitor Log Analytics workspace with appropriate retention policies and enable audit logging on all policy changes.

Cost optimisation. The routing policy described in Section 2 alone can reduce LLM costs by 40-60% for workloads with a mix of simple and complex queries. Monitor the x-routed-model header to verify that classification rules are working as intended, and adjust the prompt-length threshold based on observed latency and quality metrics.

Testing. Before deploying token-based throttling to production, validate the estimation function against actual usage. A mismatch between estimated and actual tokens can lead to premature 429s or quota overruns. The outbound header x-total-tokens-actual enables this validation during a shadow-testing phase.

Conclusion

Azure API Management is not just a reverse proxy for REST APIs — it is a fully capable AI gateway that addresses the specific governance, cost-control, and observability challenges posed by LLM endpoints. For Malaysian enterprises navigating the intersection of cloud adoption (MyDigital), data sovereignty (PDPA), and AI enablement, APIM provides a familiar control surface backed by enterprise-grade SLAs.

The four policy patterns presented here — token-based throttling, content-aware routing, structured observability, and chargeback tracking — form a foundation that can be incrementally adopted. Start with throttling to prevent cost overruns, add routing to optimise model utilisation, layer observability for visibility, and finally implement chargeback for financial accountability.

Cloud Catalyst has deployed this architecture for multiple clients in Malaysia's financial services and public sectors. The consistent feedback is that APIM's policy engine, while originally designed for southbound REST traffic, maps surprisingly well to the northbound challenges of LLM governance. With the patterns above, your enterprise can move from uncontrolled LLM experimentation to a governed, observable, and cost-managed AI platform — without introducing yet another gateway into your stack.

Law Wen Feng is Principal Solution Architect at Cloud Catalyst, where he leads Azure platform engineering for enterprise clients across Malaysia and Southeast Asia. He specialises in API management, cloud security architecture, and AI infrastructure design.

For a workshop or proof-of-concept deployment of Azure APIM as an AI gateway for your organisation, contact Cloud Catalyst at https://cloudcatalyst.com.my.