K8S源码分析Kubelet - 6、Pod驱逐

为了保证Node节点的稳定性，当资源(memory/storage)出现紧缺时，kubelet会主动选择驱逐一些Pods来释放资源。实现该功能的组件是Eviction Manager。当驱逐一个Pod时，kubelet会将pod内的所有containers都kill掉，并把pod的状态设置...

chunlie7017

673人浏览 · 2019-07-31 17:32:00

chunlie7017 · 2019-07-31 17:32:00 发布

为了保证Node节点的稳定性，当资源(memory/storage)出现紧缺时，kubelet会主动选择驱逐一些Pods来释放资源。实现该功能的组件是Eviction Manager。

当驱逐一个Pod时，kubelet会将pod内的所有containers都kill掉，并把pod的状态设置为Failed。被kill掉的pod，可能被调度到其他node上。

可以人为定义Thresholds来告诉Kubelet在什么情况下驱逐pods。有两种类型的thresholds：

Soft Eviction Thresholds - 到达阈值时，并不会马上触发驱逐操作，而是会等待一个用户配置的grace period之后再触发。

Hard Eviction Thresholds - 立刻Kill Pods。

关于Eviction相关的内容可以阅读官方文档 https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

实现

Eviction Manager相关的代码在包/pkg/kubelet/eviction中，核心逻辑是managerImpl.synchronize方法。EvictionManager会在一个单独的协程中周期性调用synchronize方法，实现驱逐。

synchronize方法主要包含以下几个步骤：

1.初始化配置

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod {

// 1. 从配置中读取所有Thresholds

thresholds := m.config.Thresholds

if len(thresholds) == 0 {

return nil

}

....

// 2. 初始化rank funcs/reclaim funcs等

if m.dedicatedImageFs == nil {

hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()

if ok != nil {

return nil

}

m.dedicatedImageFs = &hasImageFs

m.resourceToRankFunc = buildResourceToRankFunc(hasImageFs)

m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)

}

// 3. 通过初始化传入的func获取Pods

activePods := podFunc()

// 4. 通过summary provider获得当前资源使用情况

observations, statsFunc, err := makeSignalObservations(m.summaryProvider, capacityProvider, activePods)

....

// 5. 通过memcg加速内存占用通知，只在notifiersInitialized 为false时进入

if m.config.KernelMemcgNotification && !m.notifiersInitialized {

....

m.notifiersInitialized = true // 初始化完成

err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {

// 回调函数，memcg的通知会立即触发synchronize函数

glog.Infof("hard memory eviction threshold crossed at %s", desc)

m.synchronize(diskInfoProvider, podFunc, capacityProvider)

})

....

}

....

}

2.计算Thresholds

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod {

....

// 1. 根据配置的thresholds参数以及当前资源使用情况，计算出超出的thresholds

thresholds = thresholdsMet(thresholds, observations, false)

// 2. 合并上次计算的thresholds结果

if len(m.thresholdsMet) > 0 {

thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)

thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)

}

// 3. 过滤未真正激活的soft thresholds

now := m.clock.Now()

thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

....

thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)

// 4. 更新计算结果

m.Lock()

m.nodeConditions = nodeConditions

m.thresholdsFirstObservedAt = thresholdsFirstObservedAt

m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt

m.thresholdsMet = thresholds

// determine the set of thresholds whose stats have been updated since the last sync

thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)

debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)

m.lastObservations = observations

m.Unlock()

...

}

3.计算本轮Eviction考察的Resource

在每一轮Eviction中，kubelet至多只会kill一个Pod。由于Eviction Manager会同时处理多种资源(memory/storage)的紧缺情况，因此在选择Pod之前，首先会选出本轮Eviction参考的资源类型，再将Pods对该种资源的使用量进行排序，选出kill掉的Pod。

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod {

....

// 1. 收集当前所有发生紧缺的resource类型

starvedResources := getStarvedResources(thresholds)

if len(starvedResources) == 0 {

glog.V(3).Infof("eviction manager: no resources are starved")

return nil

}

// 2. 排序并选择其中一种resource

sort.Sort(byEvictionPriority(starvedResources))

resourceToReclaim := starvedResources[0]

// determine if this is a soft or hard eviction associated with the resource

softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)

.....

// 3. 根据选中Resource的使用量来排序Pods

rank, ok := m.resourceToRankFunc[resourceToReclaim]

....

rank(activePods, statsFunc)

....

}

4. Kill Pod

在排好序的Pods中选择第一个Kill掉：

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod {

....

for i := range activePods {

pod := activePods[i]

...

status := v1.PodStatus{

Phase: v1.PodFailed,

Message: fmt.Sprintf(message, resourceToReclaim),

Reason: reason,

}

....

gracePeriodOverride := int64(0)

if softEviction {

gracePeriodOverride = m.config.MaxPodGracePeriodSeconds

}

// 真正KillPod

err := m.killPodFunc(pod, status, &gracePeriodOverride)

if err != nil {

glog.Warningf("eviction manager: error while evicting pod %s: %v", format.Pod(pod), err)

}

return []*v1.Pod{pod}

}

glog.Infof("eviction manager: unable to evict any pods from the node")

return nil

}

转载于:https://my.oschina.net/zhuhui/blog/3081731

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub