Node Allocatable in AKS

Kubernetes’ Node Allocatable feature allows the cluster to reserve the resources of node for system daemons of OS and Kubernetes itself. For example, when I ran kubectl describe node for a node in my AKS cluster, I got the following capacity related output. The size of this node was Standard DS2 v2 which had 2 CPU cores and 7GB memory.

Capacity:
  attachable-volumes-azure-disk:  8
  cpu:                            2
  ephemeral-storage:              101445900Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         7113660Ki
  pods:                           110
Allocatable:
  attachable-volumes-azure-disk:  8
  cpu:                            1900m
  ephemeral-storage:              93492541286
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         4668348Ki
  pods:                           110

From the output, out of 2 cores and 7GB memory, there were 1900 millicores and 66% memory (4668348/7113660) allocatable to pods. The details about how these numbers were calculated are described in this document. In short, the configuration of node allocatable in AKS is as follows:

CPU

CPU cores on host1248163264
Kube-reserved (millicores)60100140180260420740

Memory

  • eviction-hard: 750Mi, which is the default configuration of the upstream aks-engine.
  • kube-reserved: regressive rate
    • 25% of the first 4 GB of memory
    • 20% of the next 4 GB of memory (up to 8 GB)
    • 10% of the next 8 GB of memory (up to 16 GB)
    • 6% of the next 112 GB of memory (up to 128 GB)
    • 2% of any memory above 128 GB

However, the above memory reservation doesn’t seem to applicable to Windows nodes. The following was the output when I ran the kubectl describe node for a Windows node. More memory was reserved for Windows nodes.

Capacity:
  attachable-volumes-azure-disk:  8
  cpu:                            2
  ephemeral-storage:              133703676Ki
  memory:                         7339572Ki
  pods:                           30
Allocatable:
  attachable-volumes-azure-disk:  8
  cpu:                            1900m
  ephemeral-storage:              133703676Ki
  memory:                         3565108Ki
  pods:                           30

The document doesn’t have any information regarding Windows node. GKE reserves approximately 1.5 times more resources on Windows Server nodes. Not sure if it’s the same for AKS. I’ve opened an issue to ask for further information.

This post provides a good comparison of reserved resources for 3 major cloud offerings: AKS, GKE and EKS.

PowerToys for Windows 10

I’ve been using Microsoft PowerToys on my work machine for some time and find myself more and more rely on it every day. The name, PowerToys, is a very old name on Windows. The 1st version of PowerToys came with Windows 95 more than two decades ago. It’s actually a fantastic idea to build new the set of productivity tools on top of a legacy brand. It feels like the good old days were finally connected with the new era, and the history continues.

I cannot remember exactly when I used PowerToys for Windows last time. It could be the time around Windows 98 when I was still in college, long time ago. Honestly, I was not a fan of it at that time. Although I’ve forgotten why I didn’t like it, it was not something that I must install on my machine. But PowerToys for Windows 10 has changed my mind and it has made its way to my must-installed software list.

The two features that I used most in PowerToys are FancyZones and File Explorer Preview. As I’m using a 4k monitor, FancyZones helped me to better utilize the space of the monitor. It made me feel like the laptop screen becomes redundant that I turned it off most of the time. File Explorer Preview is an add-on of Windows file explorer. I can preview file content directly in Windows explorer with it. The best part is it supports markdown preview. Now I can read those README files without even opening them with markdown editors.

Another tool that I am going to use frequently is PowerToys Run which was just released with version 0.18. I always wanted to have such a tool to run apps quickly and I’ve already tried some 3rd tools. PowerToys Run looks quite promising.

The PowerToys is quite stable. I didn’t hit any problem with it since I started using it. So if you are on Windows 10 and haven’t tried it, maybe it’s the time. 🙂

Return HTTP 405 When HTTP Method Not Match in Azure API Management

In Azure API Management, when the HTTP method of a request doesn’t match the one defined in the corresponding operation, APIM returns status code HTTP 404 Resource Not Found to the client. For example, the OOTB Echo API defined the Retrieve resource (cached) operation with HTTP GET. If you call it with HTTP POST, you’ll get HTTP 404 Resource Not Found in the response.

The HTTP 404 returned by APIM in this scenario doesn’t really follow the definition of HTTP status code strictly. According to the definition, HTTP 405 Method Not Allowed is designated for this situation. There was a feedback for this issue to APIM team and it would be addressed in the future according to the response. But before that, we have to use some workaround to bypass this issue. Here is how you can do it with policies in APIM.

Handle the error

When APIM failed to identify an API or operation, it will raise a configuration error which will returns HTTP 404. What we need to do is to handle this error and change the status code to HTTP 405. In this way, you avoid the overhead of creating operations for each of the HTTP methods to handle the situation. The next question is on which scope the error should be handled. Depending on the configurations of your APIM, I think you can handle the error on either all operations or all APIs.

The policy code

The following policy code is a sample for Echo API all operations.

<on-error>
    <base />
    <choose>
        <when condition="@(context.LastError.Source == "configuration" && context.Request.Url.Path == "/echo/resource-cached")">
            <return-response>
                <set-status code="405" reason="Method not allowed" />
                <set-body>@{
                    return new JObject(
                        new JProperty("status", "HTTP 405"),
                        new JProperty("message", "Method not allowed")
                    ).ToString();
                }</set-body>
            </return-response>
        </when>
        <otherwise />
    </choose>
</on-error>

The tricky part is in the <when> condition. The first part of the condition is to check if this is a configuration error. If it is, the second part will test if the request is to Retrieve resource (cached) operation. The second test is to avoid the situation where a real HTTP 404 happens.

You may wonder why I used context.Request rather than context.Operation to test which operation it is. The reason is APIM sets context.Operation to null in this case because it cannot identify which operation it is (and that is why the configuration error happens).

You can use this walkaround to return HTTP 405 until APIM fixes its behavior.

十九年

刚才在Linkedin上看到有人发了一个帖子,庆祝SharePoint的第19个生日。SharePoint的创始人之一,Jeff Teper在转发的时候说,明年大家一起出席庆祝20年。我忽然有些小小的感触。

SharePoint是我加入微软之后所专注的第一个服务器端产品。前前后后差不多有十年的时间,我都是围绕它展开工作的,也因此对它相当有感情。很多年前,我曾经写了几篇blog,介绍SharePoint的早期历史(SharePoint简史I, II, III)。2011年,我放弃了春节休假,跑去雷德蒙德参加了当时SharePoint的最高级别证书,Microsoft Certified Master for SharePoint,的培训和考试。我甚至一度以为,我的职业生涯会一直伴随SharePoint发展了。

但是到了2015年,当微软开始真正向云服务转型的时候,我忽然发现之前积累的SharePoint经验似乎没有了用武之地。当SharePoint作为Office 365的一部分转型为SaaS类型的云服务后,SharePoint顾问对于客户的价值大大降低了。SaaS是即插即用的,客户不再需要部署和管理本地服务器,不用操心数据库和存储的性能,更不需要知道SharePoint运维的最佳实践。SaaS的可定制性也大大缩减,客户不再能够将SharePoint作为一个开发平台,来开发各式各样的应用了。那时我意识到,我与SharePoint的缘分,尽了。当公司转型的时候,我也该开始转型了。

最近几年,我已经没在做SharePoint相关的项目,也没有关注过SharePoint的进展了。我的工作重心已经转移到了Azure上。以前在SharePoint上累计的关于web服务和数据库的开发经验,仍能不断应用在Azure的项目上。SharePoint作为Office 365的核心服务之一,应该会发展的很好,但我应该不会参加它的20周年生日会了。

Azure Batch – Create a Custom Pool using Shared Image Gallery

If you have a custom image in a Shared Image Gallery and you want to use it to create a pool in Azure Batch, this document, Use the Shared Image Gallery to create a custom pool, provides a pretty good guidance for it. I followed it to test the scenario and hit two minor issues.

  • As it is mentioned in another doc, AAD authentication is also a prerequisite for using shared image gallery. If you use --shared-key-auth with az batch account login, you would hit an anthentication error with Azure Cli. I raised an issue for the document and hopefully a note will be added to it.
  • There is no sample code to demonstrate how to create a pool with shared image gallery with Python.

So I wrote a simple sample in Python. It is based on the latest version (9.0.0) of Azure Batch package for Python. And it uses a service principal for the AAD authentication. The custom image I used for test was built on top of Ubuntu 18.04-LTS. So the node agent sku is ubuntu 18.04. It needs to be changed accordingly if other os version is used.

# Import the required modules from the
# Azure Batch Client Library for Python
import azure.batch._batch_service_client as batch
import azure.batch.models as batchmodels
from azure.common.credentials import ServicePrincipalCredentials

# Specify Batch account credentials
account = "<batch-account-name>"
batch_url = "<batch-account-url>"
ad_client_id = "<client id of the SP>"
ad_tenant = "<tenant id>"
ad_secret = "<secret of the SP>"

# Pool settings
pool_id = "LinuxNodesSamplePoolPython"
vm_size = "STANDARD_D2_V3"
node_count = 1

# Initialize the Batch client
creds = ServicePrincipalCredentials(
    client_id=ad_client_id,
    secret=ad_secret,
    tenant=ad_tenant,
    resource="https://batch.core.windows.net/"
)
config = batch.BatchServiceClientConfiguration(creds, batch_url)
client = batch.BatchServiceClient(creds, batch_url)

# Create the unbound pool
new_pool = batchmodels.PoolAddParameter(id=pool_id, vm_size=vm_size)
new_pool.target_dedicated = node_count

# Configure the start task for the pool
start_task = batchmodels.StartTask(
    command_line="printenv AZ_BATCH_NODE_STARTUP_DIR"
)
start_task.run_elevated = True
new_pool.start_task = start_task

# Create an ImageReference which specifies the Marketplace
# virtual machine image to install on the nodes.
ir = batchmodels.ImageReference(
    virtual_machine_image_id="<resource id of the image version in sig>"
)

# Create the VirtualMachineConfiguration, specifying
# the VM image reference and the Batch node agent to
# be installed on the node.
vmc = batchmodels.VirtualMachineConfiguration(
    image_reference=ir,
    node_agent_sku_id="batch.node.ubuntu 18.04"
)

# Assign the virtual machine configuration to the pool
new_pool.virtual_machine_configuration = vmc

# Create pool in the Batch service
client.pool.add(new_pool)

Update: I polished the above sample code and pushed it into the document I mentioned at the beginning of this post via a PR. The Python sample code in that document is based on the one in this post.

缺货

从农历新年前开始,我就基本在家工作了。原本年初的时候,因为换了工作,不太用出差了,我还打算今年多去公司工作,多用用公司桌上的大曲面屏幕,多和同事约约饭。谁知计划赶不上变化,COVID-19一来,只能是在家工作,公司基本不去了。

COVID-19看起来一时半会过不去,在家工作还得持续一段日子,我就想着把我桌上这台用了几年的Dell显示器换掉。这台Dell显示器是几年前买的便宜货,虽然是HD的分辨率,但是颜色和亮度都不够,之前买来是为了接Raspberry Pi的,拿来工作不是很爽。我打算把它换成一台4k的显示器。

上网大概研究了一下,发现这台LG的显示器性价比还不错,32寸,4k,基座高度可调节,新币650左右,不算贵,完美符合我的需求。

LG 32UK550

可是我把新加坡的几大电器商店跑了个遍,却发现这个型号统统缺货了。这时候我才反应过来,大家现在都在家办公了,以前没太多人关注的显示器忽然成了抢手货。再加上中国和韩国疫情爆发产业链中断,需求增加供应不足,可不得缺货嘛!不光新加坡,我听澳洲同事说,澳洲的显示器也卖断货了。看来除了口罩卫生纸,显示器大家也抢啊。

谁也不会想到,2020年竟是这样一个开端。去年网上流行这么一句话,“2019年是过去十年中最差的一年,也是未来十年中最好的一年”。这句话看来是要一语成谶了。COVID-19是天灾更是人祸,但即使面对如此的全球危机,大国之间不但没能团结合作,反而是在互相甩锅。政治人物为了选票更是走极端,明目张胆地种族歧视也毫不避讳。全球经济衰退,新一轮金融危机已经在门外了,这世界还会好吗?

新加坡的防疫形式也不容乐观,今天又新增了47起确诊病例。新加坡前一波防住了中国,这一波看来是防不住世界了。

My dotfiles

Recently I spent more and more time on WSL 2. I ran a Ubuntu 18.04 on WSL 2 and mainly used the tools such as zsh, tmux and vim. To make fun with the environment, I customized them little by little, and now the environment looks like this:

my terminal

As there are several customized dotfiles now, I put them together and created a simple script so that I can run a single script to get the environment ready. Now I have my own dotfiles repo and it is here. I will update it if I make other changes to these dotfiles in the future.

And I also shared the profile of my Windows Terminal here. Now I can easily get Windows Terminal and WSL 2 environment configured on any Windows 10 machine.

CKAD Certified

Early this week, I took the 2nd try of the CKAD exam and passed it. I scored 89% this time despite the situation where I ran out of time and couldn’t finish the last question completely. Now I am CKAD certified!

I had my first try of the exam in Aug last year without many preparations. As CNCF gives you a free 2nd chance if the 1st one is not successful, I wanted to take the 1st exam as an opportunity to test my Kubernetes knowledge, to get familiar with the test environment, and to sense how difficult it is. I scored 64% of the 1st exam which is 2% short of the passing score.

Late last year I was busy on my work and until recently I got some time to properly prepare for and take the exam again. I spent about 2 weeks to polish my skills with kubectl and other tools, and used the resources in the following two github repos heavily.

Here are several my tips regarding the exam which I hope could help those who are preparing for the certificate.

  • The only tools that you can use in the exam environment are kubectl, vim and tmux. So be very familiar with them.
  • In the exam, you are allowed to open another browser tab to connect to https://kubernetes.io/docs, but you may not have enough time to read the docs in detail. I relied on kubectl explain more than checking the docs.
  • As the exam environment runs in Chrome, a big screen definitely helps.
  • Most importantly, a lot of practices. The history showed that I tapped k/kubectl for 1372 times and vim for 429 times in 2 weeks before the exam.

美式意餐

虽然是星期一,晚饭时间林肯中心二楼的鼎泰丰还是有许多人排队,位子难等。这家鼎泰丰开了有10年了吧,印象中我就只在非用餐高峰时间,比如下午三四点钟,在这里吃过东西。从没在中午或晚上的用餐时段,成功等到过位子。今天也不例外。于是我转头去了他家楼下的Maggiano’s Little Italy。

林肯中心的Maggiano’s Little Italy

这是一家美式意大利餐厅,算是Bellevue这里的老字号了。我第一次在这家餐厅吃饭,还是在2011年,也是因为没排到鼎泰丰的位子。那时我在Redmond培训,也是冬天,培训内容繁多还有考试挺累。那时Bellevue的鼎泰丰也才刚开业。好不容易到了周末,想说去鼎泰丰吃顿像样的中餐,可是谁知他家的队伍排满了走廊。我实在没时间排队,于是就发现了Maggiano’s Little Italy。

所谓美式意大利餐,一大特点就是,分量很大。就比如Maggiago’s装pasta的盘子很大,一盘pasta的分量,在意大利怕是要装两盘。这导致的后果就是,一个人去吃的话,没办法多点几样尝尝,一份主菜就吃饱了。

我一直以为Maggiano’s是Bellevue本地的一家餐馆,刚上他们的网站看了一下才知道,原来是它是一间从芝加哥开始的连锁餐厅。只是在华州似乎只有Bellevue这一间店而已。看他的网站介绍,故事平平无奇,连创始人的名字都没有,不免有一些小失望。

Configuring VNET integration of Azure Redis and Web App

To configure Azure Redis with the VNET support, we can follow the steps described in this document. And to integrate Azure web app with a VNET, there is a detailed document for it as well. In this post, I listed some of the common issues that one might hit during the configuration.

  1. The VNET integration for Azure Redis requires an empty subnet for the VNET that is created with Resource Manager. This subnet needs to be created before you create Azure Redis. Otherwise, the configuration would fail.
  2. The subnet for Azure Redis can be protected with a network security group (NSG). Usually the default NSG rules are good enough for protecting the connections. If you need further hardening, you will have to create rules based on the ports list in the Azure Redis document.
  3. To troubleshoot the connection between Azure web app and Azure Redis, you can use the Kudu of web app. There are two tools built in with the web app for network troubleshooting:
    nameresolver.exe can be used to test the DNS functionalities, and
    tcpping.exe
    can be used to test if the host and port can be pinged. But you cannot test the Redis function directly from the Kudu.
  4. Once the VNET integration is configured, the Redis console in Azure Portal will not work anymore. To test the Redis functions with tools such as redis-cli, you will have to build a VM in the VNET and connect to Azure Redis from it.
  5. If somehow your web app cannot access the Azure Redis, although the network configurations are correct, you can try to sync the network for App Service Plan. See this issue for details. Make sure you don’t hit any error when syncing the network.