Migrate from Pod Identity to Workload Identity on AKS

I’ve been using AAD Pod Identity and managed identity for some of my workloads. Since AAD Workload Identity now supports user assigned managed identity, it’s time to migrate my workloads from Pod Identity to Workload Identity. So here is how I did it.

Enable Workload Identity on an existing AKS cluster

To use Workload Identity, the AKS version needs to be 1.24 or higher. The Workload Identity is still a preview feature. It needs to be enabled on the cluster. To do so, the latest version of aks-preview cli extension is needed. We can either add or update it with az extension add --name aks-preview or az extension update --name aks-preview.

We need to register the EnableWorkloadIdentityPreview and EnableOIDCIssuerPreview feature flags first.

az feature register --namespace "Microsoft.ContainerService" --name "EnableWorkloadIdentityPreview"
az feature register --namespace "Microsoft.ContainerService" --name "EnableOIDCIssuerPreview"

It will take some time for these feature flags to be registered. The following command can be used to monitor the status.

az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/EnableOIDCIssuerPreview') || contains(name, 'Microsoft.ContainerService/EnableWorkloadIdentityPreview')].{Name:name,State:properties.state}"

When the feature flags are registered, refresh the registration of the resource provider.

az provider register --namespace Microsoft.ContainerService

Now we can enable these two features on the cluster.

az aks update -g <resource-group> -n <cluster-name> --enable-workload-identity --enable-oidc-issuer

When the features are registered successfully, we need to get the URL of the OIDC issuer from the cluster. We can save it in an environment variable.

export AKS_OIDC_ISSUER="$(az aks show -n myAKSCluster -g myResourceGroup --query "oidcIssuerProfile.issuerUrl" -otsv)"

Create Kubernetes service account

We can use the following yaml to create a Kubernetes service account.

apiVersion: v1
kind: ServiceAccount
    azure.workload.identity/client-id: <USER_ASSIGNED_CLIENT_ID>
    azure.workload.identity/use: "true"

The USER_ASSIGNED_CLIENT_ID is the client id of the user assigned managed identity that we will use. We can reuse the one that was used with the Pod Identity. But since we need to enable the federated identity credential on this managed identity later, and some regions don’t support federated identity credential on user assigned managed identity yet, if our managed identity happens to be in those region, we will have to create a new one in the supported regions.

Enable federated identity credential on the managed identity

We can use the following command.

az identity federated-credential create --name <fid-name> --identity-name <user-assigned-mi-name> --resource-group <rg-name> --issuer ${AKS_OIDC_ISSUER} --subject system:serviceaccount:<service-account-namespace>:<service-account-name>

Update our workloads to use Workload Identity

To use the Workload Identity with our workloads, it is very simple. We just need to add serviceAccountName: <service-account-name> in our pod spec. If the code of the workloads has already been leveraging the Azure Identity SDK, for example, for .NET code if it has been using either DefaultAzureCredential or ManagedIdentityCredential to get the credential, we almost don’t need to make any change to the code. It should work straightly with the workload identity. However, if the code is using other ways to get the credential, it will have to be updated to work with the workload identity. There are some examples in the workload identity repo which shows how it works for different languages.

AKS team provide us a sidecar image that can be used as a temporary migration workaround. But updating the code is the long term solution.

Remove the Pod Identity

When we migrate the workloads to Workload Identity, we can remove the Pod Identity and the CDRs like AzureIdentity and AzureIdentityBinding etc. Depending on how Pod Identity is deployed on the cluster, there are different ways to remove it. For example, on my cluster I installed it with Helm. So I can simply uninstall it with helm uninstall.

That’s all about the migration.



由于榴莲独特的气味,喜欢的人会喜欢的不得了,而不喜欢的人则会避之唯恐不及。也因此,许多公共场合对于携带已经打开的榴莲是有限制的,比如新加坡的酒店,商场和公共交通工具上,都是不允许携带榴莲的。我们上次去迪沙鲁 (Desaru),酒店里在显眼的地方贴着告示,禁止在酒店房间吃榴莲。有的住客就只好坐在酒店通往沙滩的台阶上,享用他们的榴莲了。因此,最享受的榴莲吃法,是去榴莲摊,现买现开,坐下来吃。不但吃的新鲜,也省去了带来带去的麻烦。

新加坡有许多知名的榴莲摊,比如 Dempsey hill 附近的 Ah Di 榴莲,每次去都会看到排队的人龙。不过,要说吃榴莲的好去处,我觉得还是要属芽笼,特别是33巷到36巷附近那一带的几间大的档口。傍晚时分,送货的卡车会送来当天新采摘的榴莲。这时老饕们就会陆续出现,选好几个榴莲,然后找个街边的位子坐下,开始大快朵颐。这时候,那一带的空气里都充满了浓浓的榴莲味。



Creating ARM template or Bicep for Azure with C# Code

Writing ARM templates or Bicep code for Azure deployments is always difficult for me, even though the Visual Studio Code extensions for ARM templates and Bicep are great tools and help a lot. I’d always like to have such a tool with which I can simply say what I want to deploy, and it will automatically generate an ARM template or Bicep code for me. I even created such a tool for the virtual network previously. It is very simple and specific, and only works for virtual networks. I want a more generic tool which could work for almost everything on Azure. I cannot seem to find such a tool, so I decided to roll my own in the last winter holidays, and I named it Azure Design Studio.

As I’m using Blazor WebAssembly as the core stack, one of the challenges Azure Design Studio was facing is how to translate user inputs from UI to the Json of ARM templates. In vnet planner, I leveraged the dynamic of C#. It works fine for limited number of resources, but it is not scalable when I want to cover all Azure resources in Azure Design Studio. Also Azure ARM APIs and schemas are updated frequently. It’s impossible to catch up with the updates if the attributes and properties have to be verified manually. I need a set of POCO classes which provide the strong type in C#, can be serialized to Json easily, and are up to date with the latest ARM schemas.

The packages of Azure SDK for .NET have model classes for Azure resources that can be used for my purpose. But there are two problems with it: 1) The Json tooling used by Azure SDK is based on Newtonsoft.Json. I’d prefer to use System.Text.Json as much as possible as it is the recommended option from the Blazor team. 2) The package size of Azure SDK is huge. For example, the size of Microsoft.Azure.Management.Network is about 4.72MB which makes it not suitable for Blazor WASM.

So as a side project, I created AzureDesignStudio.AzureResources which includes a set of packages for Azure resources. In these packages, there are only POCO classes, so the size is minimal compared to Azure SDK. For example, the size of AzureDesignStudio.AzureResources.Network is only 94.09KB. The classes are decorated with System.Text.Json attributes, so there is no dependency on Newtonsoft.Json. All classes are generated automatically from the latest ARM schemas with a tool. So I hope it can catch up with the updates.

As a pilot, I’m using it in vnet planner and Azure Design Studio now. The following is an example of how you can use the package to create an ARM template easily.

VirtualNetworks vnet = new()
    Name = "Hello-vnet",
    Location = "eastus",
    Properties = new()
        AddressSpace = new()
            AddressPrefixes = new List<string> { "" }

VirtualNetworksSubnets subnet = new()
    Name = $"{vnet.Name}/subnet1",
    Properties = new()
        AddressPrefix = ""

DeploymentTemplate deploymentTemplate = new()
    Schema = "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    ContentVersion = "",
    Resources = new List<ResourceBase> { vnet, subnet }

var template = JsonSerializer.Serialize(deploymentTemplate,
    new JsonSerializerOptions(JsonSerializerDefaults.Web)
        DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingDefault,
        Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
        WriteIndented = true,
        Converters = { new ResourceBaseJsonConverter() }


For more info and examples, please check out its GitHub repo. Again, this is a very opinionated personal project without any warranty. Use it at your own risk.

Kubernetes Secrets vs. Azure Key Vault

Here is an opinionated comparison that I created. Hope it can help you make the decision when you have to choose one.

Kubernetes SecretsAzure Key Vault
How secrets are storedStored in Etcd with base64 encoded. (In AKS, Etcd is managed by Microsoft and is encrypted at rest within Azure platform.)Encrypted at rest, and Azure Key Vault is designed so that Microsoft does not see or extract your data.
Who can access the secretsBy default,
– People who can access the API server. 
– People who can create a pod and mount the secrets. 
– For AKS, people who can manage the AKS cluster in Azure Portal.
– People who can connect to the Etcd directly. (In AKS context, this would be Microsoft.)

To limit the access, Kubernetes RBAC rules need to be carefully designed and maintained.

Use tools such as Sealed Secrets to encrypt the secrets in Etcd if you don’t want Microsoft to see your secrets.
In terms of using Secret Store CSI driver for Azure Key Vault:

– People who can create a pod and mount the CSI volume.

Proper Kubernetes RBAC settings on namespaces could help to limit the access.
Who can create, change or delete secretsSimilar to the above. Kubernetes RBAC rules are needed to limit who can create, change or delete secrets.Secrets cannot be modified via Secret Store CSI driver. The secret management needs to be done via Azure Portal, CLI or APIs. The access is controlled by Azure RBAC.
Rotate or revokeManually done via Kubernetes APIAzure Key Vault provider for Secret Store CSI driver supports auto rotation of secrets.
Auditing, monitoring and alertingKubernetes AuditingAzure Monitor, Event Grid notifications, and automation with Azure Functions/Logic Apps etc.
Kubernetes Secrets vs. Azure Key Vault

The Essentials of Resource Management in Kubernetes

The resource requests and limits that we set for containers in a Pod spec are the key settings that we can use to influence how Kubernetes schedule the pod and manage the computational resources, typically CPU and memory, of nodes. Understanding how resource requests and limits work is essential to understand the resource management in Kubernetes.

Resource requests and limits

The Kubernetes scheduler uses the resource requests as one of factors to decide on which node the pod can be scheduled. Rather than looking at the actual resource usage on each node, the scheduler uses the node allocatable and the sum of the resource requests of all pods running on the node to make the decision. If you don’t set the resource requests, your pod may be scheduled on any node which still has the unallocated resource. But on the other side, your pod may not get enough resource to run or may even be terminated if the node is under resource pressure. Setting resource requests ensures containers get the minimum amount of resources that they need. It also helps the kubelet determine the eviction order when necessary.

With the resource limits, you set the hard limits of the resources a container can use. Hard limits mean the container cannot use more resource than its limits. If it attempts to do it, there will be consequences. If it attempts to use more CPU which is compressible, its CPU time will be throttled; If it attempts to use more memory which is incompressible, it will be terminated with an OOMKilled error. If you don’t set resource limits, the container could use all available resources on the node. But on the other side, it could become a noisy neighbor and could be terminated when the node is under resource pressure. Setting resource limits ensures the maximum amount of resources a container can use.

If you specify the resource limits for a container, but don’t specify the resource requests, Kubernetes automatically assigns the requests that matches the limits. The different combination of these two settings also defines the QoS class of the pod.

Since the scheduler only uses the resource requests when scheduling pods, a node could be overcommitted, which means the sum of the resource limits of all pods on the node could be more than the node allocatable of the node. The node could be under the resource pressure. If that happens, especially if the node is under memory pressure, the pods running on it could be evicted.

Eviction of pods

From the resource management perspective, there are 2 situations where pods could be evicted:

  1. a pod attempts to use more memory than its limit.
  2. a node is under resource pressure.

Pods could also be evicted because of other reasons, such as pod priority and preemption etc. I won’t discuss them in this post. When a pod is evicted, if it can be restarted, Kubernetes will restart it.

When a pod with resource limits is scheduled on a node, the kubelet passes the resource limits to the container runtime as the CPU/memory constraints. The container runtime sets these constraints on the cgroup of the container. When the memory usage of the container is over its limit, the OOM killer of the Linux kernel kicks in and kills it. You will see OOMKilled error in the status of the pod. The kernel takes care of the resource usage of cgroups. Whether the node is under resource pressure or not doesn’t matter.

On the other hand, the kubelet monitors the resource usage of the node. When the resource usage of the node reaches certain level, it marks the node’s condition, tries to reclaim node level resources, and eventually evicts pods running on the node to reclaim resources. When the kubelet has to evict pods, it uses the following order to select which pod should be evicted first:

  1. Whether the pod’s resource usage exceeds its requests
  2. Pod priority
  3. The pod’s resource usage relative to its requests

The kubelet doesn’t use the pod’s QoS class directly to determine the eviction order. The QoS class is more like a tool to help us, humans, estimate the potential pod eviction order. The key factor here is the resource requests of the pod. From the above list we know that, apart from the pod priority:

  • BestEffort pods would be evicted first as its resource usage always exceeds its requests and its usage relative to requests is huge, since there are no requests defined at all.
  • Burstable pods could be evicted secondly if its resource usage exceeds its requests.
  • Guaranteed pods and Burstable pods of which the usage doesn’t exceed its requests are the last in the eviction order.

Although QoS class doesn’t affect how the kubelet determines the pod eviction order, it affects the oom_score that the OOM killer of the Linux kernel uses to determine the order of containers it kills in case if the node is out of memory. The oom_score_adj value of each QoS class is in the table below.

QoS Classoom_score_adj
Burstable2 – 999


Now we know how resource requests and limits works in Kubernetes. Here are some best practices you can use when defining pods.

  • All pods should have resources requests and limits specified. You can leverage Kubernetes features such as resource quota and limit ranges to enforce it on namespaces. If you are on AKS, you can also use Azure Policy to enforce it.
  • For critical pods where you want to minimize its chances of being evicted, make sure its QoS class is Guaranteed.
  • To reduce the side effect of a user pod to the system pods, separate system pods and user pods on different nodes/node pools. If you are on AKS, create system and user node pools in the cluster.
  • If computational resources of your Kubernetes cluster are not a constraint, enable HPA and cluster autoscaler for the workloads.
  • On a node of a Kubernetes cluster, you should not deploy any components/software outside of Kubernetes. If you have to install additional components/software, use a Kubernetes native way, such as via DaemonSet.


Scaling with Application Gateway Ingress Controller

How Application Gateway Ingress Controller (AGIC) works is depicted in the following diagram on its document site.

AGIC Architecture

Rather than pointing the backend pool of App Gateway to a Kuberntes service, AGIC updates it with pods’ IP addresses. The gateway load balance the traffic to pods directly. In this way, it simplifies the network configuration between the app gateway and the AKS cluster.

When the workload needs to scale out to handle the increasing user load, there are two parts that need to be considered, the scaling of the app gateway and the scaling of pods.

Scaling for Application Gateway

Application Gateway supports autoscaling. If you don’t change its default settings, it scales from 0 to 10 instances. However, setting the minimum instance to 0 is not a good idea for production environment. As it is mentioned in the high traffic support document, autoscaling takes 6 to 7 minutes to provision and scale out to additional instances. If the number of minimum instances is too small, app gateway may not be able to handle the spike of the traffic. You may see HTTP 504 error in this case.

The rational number of minimum instances should be based on Current Compute Unit metric. An app gateway instance can handle 10 compute units. You should monitor this metric to decide how many instances you need for the minimum instances.

Scaling for Pods

Kubernetes handles the autoscaling of pods if you use HPA for it. However, when using AGIC, you could probably see HTTP 502 error when pods scale down. Actually, the HTTP 502 error could happen in the following 3 situations when AGIC is in place:

  • You scale down the pods either manually or via HPA.
  • You are doing rolling update to workload.
  • Kubernetes evicts pods.

The issue is because the app gateway backend pool cannot be updated fast enough to match the changes on AKS side. This document has more details about this issue. It also discussed some workarounds, but the issue cannot be 100% bypassed. You should be aware of the potential HTTP 502 error when you are in one of the above situations.


Now we know the issues that we may face when the workload scales. Here are several recommendations which may help to minimize the chances of errors when you expect to handle increasing user loads.

  • Set proper values for the minimum and maximum instances of app gateway. Give 20% to 30% buffer to the minimum instances.
  • For critical workloads, pre-scale the pods and temporarily disable HPA to avoid unexpected scaled down before the peak load. Enable HPA or scale down pods when peak load is off.
  • Ensure the AKS cluster has enough resources, and the critical pods have the proper QoS so the pods won’t be evicted unexpectedly.
  • Plan the proper time to do rolling update.

Enable Virtual Node on an Existing AKS Cluster

The virtual node can be enabled when you create a new AKS cluster. There are documents talking about how to do it with either the Azure CLI or Azure Portal. Since the virtual node is an AKS add-on, it can be enabled on existing AKS clusters as well, as long as the clusters are using Azure CNI as the network plug-in.

The following is the procedure of how to enable the virtual node for an existing AKS cluster.

1. In the VNET which the AKS cluster is in, create a new subnet. The virtual node is based on Azure Container Instance (ACI). In the scenario of deploying container groups to a VNET, the subnet will be delegated to ACI and therefore can only be used for container groups. So don’t use the subnets that are used by other node pools.

2. Run the following command to enable the virtual node add-on.

az aks enable-addons -n <cluster-name> -g <resource-group-name> \
-a virtual-node --subnet <subnet-name>

3. When the command completes successfully, the virtual node is enabled on the cluster. You should see the virtual node when you use kubectl get nodes. If you check the cluster status in Azure Portal, you should see the virtual node pools is enabled on the Overview page. In the AKS node resource group, a managed identity is created for the ACI connector. And the network profile is also created. You can view it with the command: az network profile list --resource-group <name of aks managed rg>.

In case you cannot see the virtual node after the add-on is enabled, a possible reason is the managed identity for the ACI connector doesn’t have the proper permission to the vnet. It could happen especially when the vnet is not in the node resource group. You can manually grant contributor permission of the vnet to the managed identity.

4. Deploy a pod to the virtual node by using nodeSelector and toleration such as this sample. Follow the steps in the same document to test if the pod works.

5. To remove the virtual node, follow the instructions in remove virtual nodes section of the document. The virtual node also needs to be removed with kubectl delete node virtual-node-aci-linux command. See a sample below.

# Disable the virtual node add-on
az aks disable-addons -a virtual-node -g <resource-group-name> -n <cluster-name>
# Delete the virtual node from the cluster nodes
kubectl delete node virtual-node-aci-linux
# Delete the network profile
MRG=$(az aks show --resource-group <resource-group-name> \
  --name <cluster-name> --query nodeResourceGroup --output tsv)
NPID=$(az network profile list --resource-group $MRG --query '[0].id' --output tsv)
az network profile delete --id $NPID -y

The disable-addon command will simply remove the virtual node add-on from the cluster. It doesn’t drain and delete the virtual node. If there are pods running on the virtual node, those pods would be ended up being in the error state, and the underlying ACI would not be removed as well. It’s better to remove all pods before disabling the add-on.

Automate End-To-End UI Testing for Blazor WebAssembly App using Playwright

When I was developing the Azure Virtual Network Capacity Planner, I had to run the UI testing manually every time when I made some changes. It was a bit troublesome and not very efficient. I’d like to automate all the end-to-end UI testing so that I don’t have to repeat them manually again. Meanwhile I also wanted to try Playwright which is an open source E2E testing tool freshly baked from Microsoft.

However, the Blazor document is very brief regarding to the E2E testing. It doesn’t mention a concrete approach that we can follow to do the E2E testing for Blazor projects. So in this post, I’ll talk about in detail how we can automate E2E UI testing for Blazor WebAssembly with Playwright, and hopefully it can help to narrow the gap.

The Host

Before we can automate the browser to do any tests, we need a web host running in the memory for the site that will be tested. For the Blazor Server project, this article from Gérald Barré talks about how you can host it and test it with Playwright very well. Actually, thanks to Gérald Barré for his excellent work, the main idea of this post is coming from his article as well.

For the Blazor WebAssembly app, we cannot host it directly in the memory because it doesn’t include the necessary server-side components that are needed for a host. The NuGet package Microsoft.AspNetCore.Components.WebAssembly.DevServer helps us debug and test the project locally. However, it is an exe rather than a dll. We cannot reference and use it in a testing project. But thanks to OSS, we can create our own host server based on the source code the DevServer. For example, the following snippet shows a version that I created. The Startup is a copy from DevServer.

public class DevServer
     public static IHost BuildWebHost(string[] args) =>
         .ConfigureHostConfiguration(config =>
             var inMemoryConfiguration = new Dictionary
                 [WebHostDefaults.EnvironmentKey] = "Development",
                 ["Logging:LogLevel:Microsoft"] = "Warning",
                 ["Logging:LogLevel:Microsoft.Hosting.Lifetime"] = "Information",
         .ConfigureWebHostDefaults(webBuilder =>  

We can then wire it up as a xUnit fixture and use it to host the Blazor WebAssembly app as a static website. See my code for more details. As the Blazor WebAssembly app has to be hosted as a static website, we need to publish it first and then provide its output folder as the content root in the tests.

Using Playwright

When we have the in-memory web host ready, we can use Playwright to automate the E2E UI testing. You can find all details about how to use PlaywrightSharp in Gérald Barré’s post. I won’t repeat it here.

One of the best parts of Playwright is it supports multiple languages. One of them is Python. With Playwright for Python, we can record the user interactions in the browser and generate the code that can be used in the test project accordingly. And it does not only generate code in Python, but in C# and JavaScript as well. Simply use the command: python -m playwright codegen --target csharp to generate the code in C# and then copy the code to the test project, we can create test cases quickly.

The following is a screencast of running a test case with Playwright headful in slow mo. For a completed test project, please find it in my repo.

Running the testing in the build pipeline

To integrate the E2E testing with the build pipeline, we can simply run dotnet test after the dotnet publish. As the output folder of dotnet publish on the build agent could be different from the one on the local machine, I made the content root configurable in the test project by adding a testsettings.json file. With it, I can run the tests from both the local machine and the build agent.

Azure Virtual Network Capacity Planner

When you implement a landing zone or deploy workloads on Azure, Virtual Network (VNet) is usually the very fundamental Azure resource that you need to plan and deploy first before other resources. Among all other important aspects that you need to plan for a VNet, there is a basic one: the address space of the VNet and its subnets. You need to ensure the VNet has the plenty of the address space for your workloads and the future growth. At the same time, you should also try not to waste IP addresses. When you integrate Azure resources with the VNet, different resources may have different scaling patterns and therefore require different address space for the subnet. So, it’s important to plan the subnets based on the requirements of the resources that you will deploy.

Quite often I got asked by customers about what kind of address space they need to plan for the VNet and subnets to run their workloads. To simplify this planning task, I created a tool, Azure Virtual Network Capacity Planner. With this tool, you create subnets for the Azure resources that you need to deploy. It helps you calculate the address space the subnets need and eventually the address space of the VNet. You can then export the result into the ARM template or CSV file for the actual deployment.

The following is a screenshot of the planner.

Azure Virtual Network Capacity Planner

The VNet planner is built with Blazor. You can find its source code in this repo. Feel free to raise issues or PR if you have any feedback or better idea.

User Groups in Azure API Management

In Azure API Management, there are 3 built-in groups: Administrators, Developers and Guests. These groups are meant for the developer portal to do the authorization for developer accounts. Based on which group a developer account is in, the developer portal controls what APIs the developer can see. The groups have nothing to do with the actual access control of the API endpoints in APIM.

According to this document, the built-in groups are immutable. Their membership is managed by APIM. You can neither add or remove users to them nor modify the groups themselves. The subscription administrators are the members of the Administrator group. It used to be possible to add a user account to the Administrators group by assigning the Api Management Service Contributor role to it. But it is not the case anymore. The users you add in the APIM are the members of Developers group. The unauthenticated users of the developer portal fall under the Guests group.

Besides the built-in groups, there is a built-in Administrator account which is immutable as well. You can neither delete it nor change its properties. Its email address is the one that you input as the Administrator email when you provision the APIM instance. There is no way for you to create or change other properties of this account, such as its first name, last name, or password etc. There is no UI for that. And if you tried to do it via the management API, you would get HTTP 405 Method Not Allow error. So be careful to choose the Administrator email when provisioning the APIM instance.

In the situation where you really have to make changes to the built-in Administrator account, try to contact Azure Support then.


It appears that the Administrator’s email can be changed through Notification templates > Email settings. However, using this option would cause a short downtime to APIM (Service is being updated… for several minutes). Be careful.