Using Azure Private Link Service Integration on AKS

In some cases, you may want to create a private link for the services that you host on an AKS cluster. With private link, clients or services outside of the AKS cluster can communicate with your services on the cluster via private endpoints.

A typical use case is that you want to use Azure Front Door as the load balancer in front of the services running on your AKS cluster. In this case, you would create a private link for your services, and configure Azure Front Door to connect to your services via the private link.

Creating a private link service manually is not a complex task. There are documents about doing it with different tools. Although you still need to have basic understanding to the concept of private link service. The only issue of creating the private link manually for services on AKS cluster is about the lifecycle management. The creation of Kubernetes services is managed by Kubernetes itself, while the creation of the private link is managed separately. It could add some overhead to operations.

So here comes the Azure Private Link Service Integration. It is a feature of Kubernetes Azure cloud provider, which allows Kubernetes to create and managed the lifecycle of the private link. You just need to include the necessary annotations on Kubernetes services that you want to expose via the private link, the Azure cloud provider will help you do the rest.

Here is a sample that I created to show you how to create a nginx ingress controller and expose it via a private link. The sample is a simple Helm chart derived from the official nginx ingress controller with customer values to include the annotations for the private link.

ChatGPT漫谈

时下互联网或者科技业最火热的话题,莫过于去年底问世的人工智能(AI)聊天应用ChatGPT了。ChatGPT的出现,使得像你我这样的普通人,在不需要了解其背后复杂的技术原理的情况下,就可以用自然语言,直接与最前沿的人工智能交互。一时之间,它不但成了朋友之间餐前酒后的热门话题,投资界的新风口,更是成为了从自媒体,社交媒体到专业媒体的宠儿,各种各样关于ChatGPT的讨论之中,夹杂着兴奋,期待,憧憬,甚至恐惧,像极了人类历史上几次大的技术革命到来之前的情景。

实际上,ChatGPT确实算得上是一次伟达的技术革命的开端,而且以它为代表的AI技术对科技业的影响已经显现了。从美国到中国的各大科技企业和研究机构都已宣布了在人工智能和类似ChatGPT技术方面的投资布局计划。微软公司更是在今年初宣布裁员一万人的同时,向ChatGPT的母公司OpenAI追加投资100亿美元。有人戏谑说,微软裁掉的一万人,是人类历史上首次有人因为AI而失业。

当然,这只是玩笑。但现实中确实有很多人已经开始借助ChatGPT来完成一些初级工作了。比如有做跨境贸易的人就宣称,现在借助ChatGPT,可以快速高效地处理与外国客户的邮件往来,已经不需要专门雇人来处理这类文书工作了。我的一个朋友,利用ChatGPT写了一封给客户的催款邮件,那邮件不但意图表达精准,英文用词和文法更是专业地道讲究,其水平是我们这些英文水平有限的人,无论上多少堂商业写作课也达不到的。

新技术的出现也许会淘汰旧的职业,但也会创造新职业。例如一种被称为Prompt Engineer的新职业已经伴随ChatGPT出现了。这个职业新到甚至都还没有正式的中文翻译,有人将其译作“提示工程师”,或者“AI沟通工程师”,其主要职责是利用各种各样的提示或问题,来训练和引导AI做出正确回答。

之所以需要这样的职业,是因为目前ChatGPT这样的AI还很不完善。人们已经发现,ChatGPT的回答不总是正确,同样的问题它可能会给出不同的答案;它的数学和科学不好,无法通过PSLE考试;它会时常产生虚假内容,所谓“一本正经地胡说八道”;它甚至还很“双标”,不愿写诗赞美特朗普,但会写诗赞美拜登,它会将特朗普,马斯克,普京等人标记为“有争议的”人物,而拜登,贝索斯,比尔盖茨等人则没有争议。想让ChatGPT给出正确的答案,需要给它相应合适的提问和引导。有人开玩笑说,人类以后的主要工作之一,就是陪AI聊天。

ChatGPT是基于一种被称为“大语言模型”的AI技术,训练这种模型需要大量的文本数据和庞大的计算资源,是一项非常昂贵的投资。据报道,ChatGPT模型的单次训练费用,大约需要140万美元。这个模型是经过了成百上千次这样的训练,才达到目前的水平。

近日有所谓的国际问题咨询专家也来蹭ChatGPT的热度,感叹中国没搞出ChatGPT这样的发明创新,是因为民主与科学。其实这样的借题发挥大可不必,而且这样的陈年旧题也了无新意。同样的问题也可以问,英法德意日,或者其它所谓的西方民主国家,没搞出ChatGPT又是因为什么?也是因为民主与科学吗?而且,将原因简单地归结为民主与科学,显然是屁股决定脑袋的表现,甚至不如ChatGPT的回答有深度。

这里就引用一段ChatGPT对所谓专家的“大哉问”的回答,来对本文作结吧。

“人工智能领域需要全球范围内的研究人员、数据和算力资源以及政策环境的支持,因此像ChatGPT这样的人工智能创新并不是由于某个国家在技术上占据了绝对优势,而是由全球共同促成的。”

我们正在经历下一代互联网的开端

昨天一大早就看到下面这封邮件躺在邮箱里。

我前几天申请了新版Bing的试用,也就是微软上周发布的,以OpenAI的ChatGPT技术驱动的Bing搜索引擎。这封邮件通知我,我可以开始试用了。

虽然我已经看过了微软发布会,也看过其他人分享的使用体验,大概了解new Bing的新功能和使用体验,但真正使用new Bing之后,那种感觉还是相当令人震惊。就是那种我们过去20年习以为常的网络搜索,在结合了ChatGPT的大语言模型之后,竟然可以这么智能;那种问题不但被完全理解,还直接得到问题答案的惊喜;那种再也不用翻遍所有链接来找答案的爽快。我想任何人在体验过new Bing之后,应该都很难再回到过去那种搜索体验了吧?!

在用过new Bing之后,我逐渐理解了微软所说的“AI-powered copilot for the web”。以前,web上的信息是通过搜索引擎组织的,你通过搜索引擎找到可能包含你需要的信息的链接,然后你得查看链接对应的网页,找到你需要的信息。这个过程有时费时费力。而new Bing就像是你忠实的助手,你告诉它你需要什么,它会帮你翻遍互联网,找出你需要的信息,帮你总结,对比,分类,甚至评论,然后给你结果。它能让你变得效率倍增。程序员早已用上了Copilot。我相信,这种copilot的能力会在越来越多的软件中出现,我们正在经历软件范式的转变。

new Bing的出现,对于Google绝对不是一个好消息。如同Satya所言,从new Bing开始,“搜索的毛利率将永远、不可逆的进入下降轨道”。显而易见的是,新的搜索方式将直接威胁Google最核心的商业模式,搜索广告。但其实广告还不是最关键的。比如我问new Bing,中三的学生如何学习物理,答案中就包含了补习中心的链接。虽然这些链接目前还不是广告,但也许将来可以用来放广告也说不定。

Google面临的更大威胁,是引入AI之后,每次搜索的计算成本将无可避免地大幅提升。我相信Google Bard的模型不会比ChatGPT差,而且很大可能会更好。广告演示的错误并不能代表Bard的真实水平。但即使Bard的模型更好,引入Bard之后,Google搜索的计算成本还是会增加。考虑到Google搜索的市场份额,这将显著增加Google的成本,影响Google的盈利能力。这无异于动摇了Google的根基,也能解释为何Google游移不定,迟迟不愿将AI加入搜索之中。接下来Google将如何应对至关重要。

随着new Bing的到来,互联网的范式将被重塑。我们正经历下一代互联网的开端。好戏正要开场,我们适逢其会。

珀斯惊魂

Perth

珀斯,在我们到达的第一天,给了我们一个下马威。

这次的西澳之行,是疫情三年以来,我们全家第一次长途旅行。在疫情之前,我们去过澳州好几次,不过之前的旅行主要集中在从布里斯班到墨尔本之间的澳州东海岸。西澳还是第一次去,因此出发之前也还是有些小小的期待。

我们到达的当天,珀斯的天气并不好。多云的天气伴随着时断时续的小雨,气温大约只有10来度。走在路上,我们不免有些瑟瑟发抖,甚至开始担心,这趟旅行,是不是衣物带少了。

在机场拿了车,因为时间还早,我们决定去酒店的途中,顺路去参观一下西澳大学。自从有了孩子,如果旅行途中经过什么知名大学,为人父母不能免俗,我们总会带孩子去看看。到了西澳大学,我们将车停在路边的停车位里。可能是假期的缘故,路边停的车子不多,路上几乎不见行人,校园也出奇的安静。我们跟着谷歌地图游荡了一会,找了间咖啡馆,坐下喝了杯咖啡歇歇脚。

当我们返回路边车位去取车时,老远就发现,车子一侧的后座车窗玻璃,似乎破了一个大洞。我们心知不妙,赶忙跑过去。果然,那扇车窗被人打破了,女儿放在车后座上的背包,已经不知去向。我们茫然四顾,一时竟有些不知所措。

一直以来,我们在美国和澳州旅行时,都是租车自驾。我们曾经开车游历了洛杉矶和拉斯维加斯,走过著名的66号公路。也曾经从纽约开车北上,穿过美加边境,去到加拿大一侧的尼亚加拉瀑布。我们曾经造访过哈佛和耶鲁的校园,将车子停在校园路边,去逛哈佛书店,去吃克林顿夫妇当年吃过的汉堡。在澳州,我们曾经开车走过著名的Great Ocean Road。也曾从悉尼开车去布里斯班,体验一路上的酒庄,农场和风景如画的海湾。走过这么多地方,这还是大姑娘上轿头一遭,遇到车窗被砸,东西被偷的情况。

清点损失,幸好后备箱的行李都还在。我们的背包都随身带着。唯一的损失,就是女儿留在后座的背包,里面有一些她的随身物品。她是我们之中损失最大的,但同时她也学到了宝贵的一课,不是到哪里都可以像在新加坡那样,东西统统都丢在车里。

接下来联系车行换车,打电话报警。所有人都对我们的遭遇表达了同情,但又似乎对这种情况并不吃惊,并都祝我们有一个愉快的西澳之行。我当时心想,但愿吧。在车行换车的时候,接待我的女士告诉我,需要交5500澳元的维修费。我大吃一惊,连忙翻出租车的合约,告诉她我在租车时选了额外的保险服务。女士扫了一眼,愉快地告诉我,那就不用付了。我如释重负。后来朋友们纷纷问我,怎么会想到在租车时多付了些钱,选了额外的保险?我也答不上来,也许就是冥冥之中,自有天意吧。

从第二天开始,珀斯的天气就云开雾散,阳光普照了。一直到我们离开前,都没再下过雨,烈日当空甚至能晒到人脱皮。我们之后的西澳之行,许是应了大家的祝福吧,很愉快。

Migrate from Pod Identity to Workload Identity on AKS

I’ve been using AAD Pod Identity and managed identity for some of my workloads. Since AAD Workload Identity now supports user assigned managed identity, it’s time to migrate my workloads from Pod Identity to Workload Identity. So here is how I did it.

Enable Workload Identity on an existing AKS cluster

To use Workload Identity, the AKS version needs to be 1.24 or higher. The Workload Identity is still a preview feature. It needs to be enabled on the cluster. To do so, the latest version of aks-preview cli extension is needed. We can either add or update it with az extension add --name aks-preview or az extension update --name aks-preview.

We need to register the EnableWorkloadIdentityPreview and EnableOIDCIssuerPreview feature flags first.

az feature register --namespace "Microsoft.ContainerService" --name "EnableWorkloadIdentityPreview"
az feature register --namespace "Microsoft.ContainerService" --name "EnableOIDCIssuerPreview"

It will take some time for these feature flags to be registered. The following command can be used to monitor the status.

az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/EnableOIDCIssuerPreview') || contains(name, 'Microsoft.ContainerService/EnableWorkloadIdentityPreview')].{Name:name,State:properties.state}"

When the feature flags are registered, refresh the registration of the resource provider.

az provider register --namespace Microsoft.ContainerService

Now we can enable these two features on the cluster.

az aks update -g <resource-group> -n <cluster-name> --enable-workload-identity --enable-oidc-issuer

When the features are registered successfully, we need to get the URL of the OIDC issuer from the cluster. We can save it in an environment variable.

export AKS_OIDC_ISSUER="$(az aks show -n myAKSCluster -g myResourceGroup --query "oidcIssuerProfile.issuerUrl" -otsv)"

Create Kubernetes service account

We can use the following yaml to create a Kubernetes service account.

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    azure.workload.identity/client-id: <USER_ASSIGNED_CLIENT_ID>
  labels:
    azure.workload.identity/use: "true"
  name: <SERVICE_ACCOUNT_NAME>
  namespace: <SERVICE_ACCOUNT_NAMESPACE>

The USER_ASSIGNED_CLIENT_ID is the client id of the user assigned managed identity that we will use. We can reuse the one that was used with the Pod Identity. But since we need to enable the federated identity credential on this managed identity later, and some regions don’t support federated identity credential on user assigned managed identity yet, if our managed identity happens to be in those region, we will have to create a new one in the supported regions.

Enable federated identity credential on the managed identity

We can use the following command.

az identity federated-credential create --name <fid-name> --identity-name <user-assigned-mi-name> --resource-group <rg-name> --issuer ${AKS_OIDC_ISSUER} --subject system:serviceaccount:<service-account-namespace>:<service-account-name>

Update our workloads to use Workload Identity

To use the Workload Identity with our workloads, it is very simple. We just need to add serviceAccountName: <service-account-name> in our pod spec. If the code of the workloads has already been leveraging the Azure Identity SDK, for example, for .NET code if it has been using either DefaultAzureCredential or ManagedIdentityCredential to get the credential, we almost don’t need to make any change to the code. It should work straightly with the workload identity. However, if the code is using other ways to get the credential, it will have to be updated to work with the workload identity. There are some examples in the workload identity repo which shows how it works for different languages.

AKS team provide us a sidecar image that can be used as a temporary migration workaround. But updating the code is the long term solution.

Remove the Pod Identity

When we migrate the workloads to Workload Identity, we can remove the Pod Identity and the CDRs like AzureIdentity and AzureIdentityBinding etc. Depending on how Pod Identity is deployed on the cluster, there are different ways to remove it. For example, on my cluster I installed it with Helm. So I can simply uninstall it with helm uninstall.

That’s all about the migration.

榴莲季

最近是榴莲季,又到了每年吃榴莲的好时候。平时按照斤两贩卖的榴莲,现在某些品种只要5元10元就可以买一整颗。有些所谓的名种,虽然还是按重量卖,价格却也比之前低了许多。狮城的许多角落里,冒出了许多大大小小的榴莲档。社区楼下的水果店,也都支起了卖榴莲的摊位。

由于榴莲独特的气味,喜欢的人会喜欢的不得了,而不喜欢的人则会避之唯恐不及。也因此,许多公共场合对于携带已经打开的榴莲是有限制的,比如新加坡的酒店,商场和公共交通工具上,都是不允许携带榴莲的。我们上次去迪沙鲁 (Desaru),酒店里在显眼的地方贴着告示,禁止在酒店房间吃榴莲。有的住客就只好坐在酒店通往沙滩的台阶上,享用他们的榴莲了。因此,最享受的榴莲吃法,是去榴莲摊,现买现开,坐下来吃。不但吃的新鲜,也省去了带来带去的麻烦。

新加坡有许多知名的榴莲摊,比如 Dempsey hill 附近的 Ah Di 榴莲,每次去都会看到排队的人龙。不过,要说吃榴莲的好去处,我觉得还是要属芽笼,特别是33巷到36巷附近那一带的几间大的档口。傍晚时分,送货的卡车会送来当天新采摘的榴莲。这时老饕们就会陆续出现,选好几个榴莲,然后找个街边的位子坐下,开始大快朵颐。这时候,那一带的空气里都充满了浓浓的榴莲味。

在这些榴莲摊,榴莲的吃法也相当豪爽。你选好想要的榴莲,伙计直接几刀将壳砍开几道裂缝,但并不将它完全打开。伙计将开好的榴莲放在塑料筐里,你端了框到自己的桌边,需要自己动手将壳完全打开,然后直接上手,取出里面的果肉开吃了。因为是直接上手的,榴莲的气味据说会在手上残留3天之久。男士们无所谓,女士们通常会戴上店家提供的一次性塑胶薄膜手套。店家还会提供一小瓶瓶装水,让你吃完之后可以漱口。

榴莲有“果王”之称,是含糖量很高的水果,因此虽然好吃,也不宜多吃。还好榴莲每年只成熟这一季,否则不知会害了多少饕客的健康呢。

Creating ARM template or Bicep for Azure with C# Code

Writing ARM templates or Bicep code for Azure deployments is always difficult for me, even though the Visual Studio Code extensions for ARM templates and Bicep are great tools and help a lot. I’d always like to have such a tool with which I can simply say what I want to deploy, and it will automatically generate an ARM template or Bicep code for me. I even created such a tool for the virtual network previously. It is very simple and specific, and only works for virtual networks. I want a more generic tool which could work for almost everything on Azure. I cannot seem to find such a tool, so I decided to roll my own in the last winter holidays, and I named it Azure Design Studio.

As I’m using Blazor WebAssembly as the core stack, one of the challenges Azure Design Studio was facing is how to translate user inputs from UI to the Json of ARM templates. In vnet planner, I leveraged the dynamic of C#. It works fine for limited number of resources, but it is not scalable when I want to cover all Azure resources in Azure Design Studio. Also Azure ARM APIs and schemas are updated frequently. It’s impossible to catch up with the updates if the attributes and properties have to be verified manually. I need a set of POCO classes which provide the strong type in C#, can be serialized to Json easily, and are up to date with the latest ARM schemas.

The packages of Azure SDK for .NET have model classes for Azure resources that can be used for my purpose. But there are two problems with it: 1) The Json tooling used by Azure SDK is based on Newtonsoft.Json. I’d prefer to use System.Text.Json as much as possible as it is the recommended option from the Blazor team. 2) The package size of Azure SDK is huge. For example, the size of Microsoft.Azure.Management.Network is about 4.72MB which makes it not suitable for Blazor WASM.

So as a side project, I created AzureDesignStudio.AzureResources which includes a set of packages for Azure resources. In these packages, there are only POCO classes, so the size is minimal compared to Azure SDK. For example, the size of AzureDesignStudio.AzureResources.Network is only 94.09KB. The classes are decorated with System.Text.Json attributes, so there is no dependency on Newtonsoft.Json. All classes are generated automatically from the latest ARM schemas with a tool. So I hope it can catch up with the updates.

As a pilot, I’m using it in vnet planner and Azure Design Studio now. The following is an example of how you can use the package to create an ARM template easily.

VirtualNetworks vnet = new()
{
    Name = "Hello-vnet",
    Location = "eastus",
    Properties = new()
    {
        AddressSpace = new()
        {
            AddressPrefixes = new List<string> { "10.0.0.0/16" }
        }
    }
};

VirtualNetworksSubnets subnet = new()
{
    Name = $"{vnet.Name}/subnet1",
    Properties = new()
    {
        AddressPrefix = "10.0.0.0/24"
    },
};

DeploymentTemplate deploymentTemplate = new()
{
    Schema = "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    ContentVersion = "1.0.0.0",
    Resources = new List<ResourceBase> { vnet, subnet }
};

var template = JsonSerializer.Serialize(deploymentTemplate,
    new JsonSerializerOptions(JsonSerializerDefaults.Web)
    {
        DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingDefault,
        Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
        WriteIndented = true,
        Converters = { new ResourceBaseJsonConverter() }
    });

Console.Write(template);

For more info and examples, please check out its GitHub repo. Again, this is a very opinionated personal project without any warranty. Use it at your own risk.

Kubernetes Secrets vs. Azure Key Vault

Here is an opinionated comparison that I created. Hope it can help you make the decision when you have to choose one.

Kubernetes SecretsAzure Key Vault
How secrets are storedStored in Etcd with base64 encoded. (In AKS, Etcd is managed by Microsoft and is encrypted at rest within Azure platform.)Encrypted at rest, and Azure Key Vault is designed so that Microsoft does not see or extract your data.
Who can access the secretsBy default,
– People who can access the API server. 
– People who can create a pod and mount the secrets. 
– For AKS, people who can manage the AKS cluster in Azure Portal.
– People who can connect to the Etcd directly. (In AKS context, this would be Microsoft.)

To limit the access, Kubernetes RBAC rules need to be carefully designed and maintained.

Use tools such as Sealed Secrets to encrypt the secrets in Etcd if you don’t want Microsoft to see your secrets.
In terms of using Secret Store CSI driver for Azure Key Vault:

– People who can create a pod and mount the CSI volume.

Proper Kubernetes RBAC settings on namespaces could help to limit the access.
Who can create, change or delete secretsSimilar to the above. Kubernetes RBAC rules are needed to limit who can create, change or delete secrets.Secrets cannot be modified via Secret Store CSI driver. The secret management needs to be done via Azure Portal, CLI or APIs. The access is controlled by Azure RBAC.
Rotate or revokeManually done via Kubernetes APIAzure Key Vault provider for Secret Store CSI driver supports auto rotation of secrets.
Auditing, monitoring and alertingKubernetes AuditingAzure Monitor, Event Grid notifications, and automation with Azure Functions/Logic Apps etc.
Kubernetes Secrets vs. Azure Key Vault

The Essentials of Resource Management in Kubernetes

The resource requests and limits that we set for containers in a Pod spec are the key settings that we can use to influence how Kubernetes schedule the pod and manage the computational resources, typically CPU and memory, of nodes. Understanding how resource requests and limits work is essential to understand the resource management in Kubernetes.

Resource requests and limits

The Kubernetes scheduler uses the resource requests as one of factors to decide on which node the pod can be scheduled. Rather than looking at the actual resource usage on each node, the scheduler uses the node allocatable and the sum of the resource requests of all pods running on the node to make the decision. If you don’t set the resource requests, your pod may be scheduled on any node which still has the unallocated resource. But on the other side, your pod may not get enough resource to run or may even be terminated if the node is under resource pressure. Setting resource requests ensures containers get the minimum amount of resources that they need. It also helps the kubelet determine the eviction order when necessary.

With the resource limits, you set the hard limits of the resources a container can use. Hard limits mean the container cannot use more resource than its limits. If it attempts to do it, there will be consequences. If it attempts to use more CPU which is compressible, its CPU time will be throttled; If it attempts to use more memory which is incompressible, it will be terminated with an OOMKilled error. If you don’t set resource limits, the container could use all available resources on the node. But on the other side, it could become a noisy neighbor and could be terminated when the node is under resource pressure. Setting resource limits ensures the maximum amount of resources a container can use.

If you specify the resource limits for a container, but don’t specify the resource requests, Kubernetes automatically assigns the requests that matches the limits. The different combination of these two settings also defines the QoS class of the pod.

Since the scheduler only uses the resource requests when scheduling pods, a node could be overcommitted, which means the sum of the resource limits of all pods on the node could be more than the node allocatable of the node. The node could be under the resource pressure. If that happens, especially if the node is under memory pressure, the pods running on it could be evicted.

Eviction of pods

From the resource management perspective, there are 2 situations where pods could be evicted:

  1. a pod attempts to use more memory than its limit.
  2. a node is under resource pressure.

Pods could also be evicted because of other reasons, such as pod priority and preemption etc. I won’t discuss them in this post. When a pod is evicted, if it can be restarted, Kubernetes will restart it.

When a pod with resource limits is scheduled on a node, the kubelet passes the resource limits to the container runtime as the CPU/memory constraints. The container runtime sets these constraints on the cgroup of the container. When the memory usage of the container is over its limit, the OOM killer of the Linux kernel kicks in and kills it. You will see OOMKilled error in the status of the pod. The kernel takes care of the resource usage of cgroups. Whether the node is under resource pressure or not doesn’t matter.

On the other hand, the kubelet monitors the resource usage of the node. When the resource usage of the node reaches certain level, it marks the node’s condition, tries to reclaim node level resources, and eventually evicts pods running on the node to reclaim resources. When the kubelet has to evict pods, it uses the following order to select which pod should be evicted first:

  1. Whether the pod’s resource usage exceeds its requests
  2. Pod priority
  3. The pod’s resource usage relative to its requests

The kubelet doesn’t use the pod’s QoS class directly to determine the eviction order. The QoS class is more like a tool to help us, humans, estimate the potential pod eviction order. The key factor here is the resource requests of the pod. From the above list we know that, apart from the pod priority:

  • BestEffort pods would be evicted first as its resource usage always exceeds its requests and its usage relative to requests is huge, since there are no requests defined at all.
  • Burstable pods could be evicted secondly if its resource usage exceeds its requests.
  • Guaranteed pods and Burstable pods of which the usage doesn’t exceed its requests are the last in the eviction order.

Although QoS class doesn’t affect how the kubelet determines the pod eviction order, it affects the oom_score that the OOM killer of the Linux kernel uses to determine the order of containers it kills in case if the node is out of memory. The oom_score_adj value of each QoS class is in the table below.

QoS Classoom_score_adj
Guaranteed-997
BestEffort1000
Burstable2 – 999

Takeaways

Now we know how resource requests and limits works in Kubernetes. Here are some best practices you can use when defining pods.

  • All pods should have resources requests and limits specified. You can leverage Kubernetes features such as resource quota and limit ranges to enforce it on namespaces. If you are on AKS, you can also use Azure Policy to enforce it.
  • For critical pods where you want to minimize its chances of being evicted, make sure its QoS class is Guaranteed.
  • To reduce the side effect of a user pod to the system pods, separate system pods and user pods on different nodes/node pools. If you are on AKS, create system and user node pools in the cluster.
  • If computational resources of your Kubernetes cluster are not a constraint, enable HPA and cluster autoscaler for the workloads.
  • On a node of a Kubernetes cluster, you should not deploy any components/software outside of Kubernetes. If you have to install additional components/software, use a Kubernetes native way, such as via DaemonSet.

Reference

Scaling with Application Gateway Ingress Controller

How Application Gateway Ingress Controller (AGIC) works is depicted in the following diagram on its document site.

AGIC Architecture

Rather than pointing the backend pool of App Gateway to a Kuberntes service, AGIC updates it with pods’ IP addresses. The gateway load balance the traffic to pods directly. In this way, it simplifies the network configuration between the app gateway and the AKS cluster.

When the workload needs to scale out to handle the increasing user load, there are two parts that need to be considered, the scaling of the app gateway and the scaling of pods.

Scaling for Application Gateway

Application Gateway supports autoscaling. If you don’t change its default settings, it scales from 0 to 10 instances. However, setting the minimum instance to 0 is not a good idea for production environment. As it is mentioned in the high traffic support document, autoscaling takes 6 to 7 minutes to provision and scale out to additional instances. If the number of minimum instances is too small, app gateway may not be able to handle the spike of the traffic. You may see HTTP 504 error in this case.

The rational number of minimum instances should be based on Current Compute Unit metric. An app gateway instance can handle 10 compute units. You should monitor this metric to decide how many instances you need for the minimum instances.

Scaling for Pods

Kubernetes handles the autoscaling of pods if you use HPA for it. However, when using AGIC, you could probably see HTTP 502 error when pods scale down. Actually, the HTTP 502 error could happen in the following 3 situations when AGIC is in place:

  • You scale down the pods either manually or via HPA.
  • You are doing rolling update to workload.
  • Kubernetes evicts pods.

The issue is because the app gateway backend pool cannot be updated fast enough to match the changes on AKS side. This document has more details about this issue. It also discussed some workarounds, but the issue cannot be 100% bypassed. You should be aware of the potential HTTP 502 error when you are in one of the above situations.

Recommendations

Now we know the issues that we may face when the workload scales. Here are several recommendations which may help to minimize the chances of errors when you expect to handle increasing user loads.

  • Set proper values for the minimum and maximum instances of app gateway. Give 20% to 30% buffer to the minimum instances.
  • For critical workloads, pre-scale the pods and temporarily disable HPA to avoid unexpected scaled down before the peak load. Enable HPA or scale down pods when peak load is off.
  • Ensure the AKS cluster has enough resources, and the critical pods have the proper QoS so the pods won’t be evicted unexpectedly.
  • Plan the proper time to do rolling update.