为了保证Node节点的稳定性,当资源(memory/storage)出现紧缺时,kubelet会主动选择驱逐一些Pods来释放资源。实现该功能的组件是Eviction Manager。
当驱逐一个Pod时,kubelet会将pod内的所有containers都kill掉,并把pod的状态设置为Failed。被kill掉的pod,可能被调度到其他node上。
可以人为定义Thresholds来告诉Kubelet在什么情况下驱逐pods。有两种类型的thresholds:
Soft Eviction Thresholds - 到达阈值时,并不会马上触发驱逐操作,而是会等待一个用户配置的grace period之后再触发。
Hard Eviction Thresholds - 立刻Kill Pods。
关于Eviction相关的内容可以阅读官方文档 https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
实现
Eviction Manager相关的代码在包/pkg/kubelet/eviction中,核心逻辑是managerImpl.synchronize方法。EvictionManager会在一个单独的协程中周期性调用synchronize方法,实现驱逐。
synchronize方法主要包含以下几个步骤:
1.初始化配置
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod { // 1. 从配置中读取所有Thresholds thresholds := m.config.Thresholds if len(thresholds) == 0 { return nil } .... // 2. 初始化rank funcs/reclaim funcs等 if m.dedicatedImageFs == nil { hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs() if ok != nil { return nil } m.dedicatedImageFs = &hasImageFs m.resourceToRankFunc = buildResourceToRankFunc(hasImageFs) m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs) } // 3. 通过初始化传入的func获取Pods activePods := podFunc() // 4. 通过summary provider获得当前资源使用情况 observations, statsFunc, err := makeSignalObservations(m.summaryProvider, capacityProvider, activePods) .... // 5. 通过memcg加速内存占用通知,只在notifiersInitialized 为false时进入 if m.config.KernelMemcgNotification && !m.notifiersInitialized { .... m.notifiersInitialized = true // 初始化完成 err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true , func(desc string) { // 回调函数,memcg的通知会立即触发synchronize函数 glog.Infof( "hard memory eviction threshold crossed at %s" , desc) m.synchronize(diskInfoProvider, podFunc, capacityProvider) }) .... } .... } |
2.计算Thresholds
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod { .... // 1. 根据配置的thresholds参数以及当前资源使用情况,计算出超出的thresholds thresholds = thresholdsMet(thresholds, observations, false ) // 2. 合并上次计算的thresholds结果 if len(m.thresholdsMet) > 0 { thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true ) thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved) } // 3. 过滤未真正激活的soft thresholds now := m.clock.Now() thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now) .... thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now) // 4. 更新计算结果 m.Lock() m.nodeConditions = nodeConditions m.thresholdsFirstObservedAt = thresholdsFirstObservedAt m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt m.thresholdsMet = thresholds // determine the set of thresholds whose stats have been updated since the last sync thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations) debugLogThresholdsWithObservation( "thresholds - updated stats" , thresholds, observations) m.lastObservations = observations m.Unlock() ... } |
3.计算本轮Eviction考察的Resource
在每一轮Eviction中,kubelet至多只会kill一个Pod。由于Eviction Manager会同时处理多种资源(memory/storage)的紧缺情况,因此在选择Pod之前,首先会选出本轮Eviction参考的资源类型,再将Pods对该种资源的使用量进行排序,选出kill掉的Pod。
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod { .... // 1. 收集当前所有发生紧缺的resource类型 starvedResources := getStarvedResources(thresholds) if len(starvedResources) == 0 { glog.V( 3 ).Infof( "eviction manager: no resources are starved" ) return nil } // 2. 排序并选择其中一种resource sort.Sort(byEvictionPriority(starvedResources)) resourceToReclaim := starvedResources[ 0 ] // determine if this is a soft or hard eviction associated with the resource softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim) ..... // 3. 根据选中Resource的使用量来排序Pods rank, ok := m.resourceToRankFunc[resourceToReclaim] .... rank(activePods, statsFunc) .... } |
4. Kill Pod
在排好序的Pods中选择第一个Kill掉:
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod { .... for i := range activePods { pod := activePods[i] ... status := v1.PodStatus{ Phase: v1.PodFailed, Message: fmt.Sprintf(message, resourceToReclaim), Reason: reason, } .... gracePeriodOverride := int64( 0 ) if softEviction { gracePeriodOverride = m.config.MaxPodGracePeriodSeconds } // 真正KillPod err := m.killPodFunc(pod, status, &gracePeriodOverride) if err != nil { glog.Warningf( "eviction manager: error while evicting pod %s: %v" , format.Pod(pod), err) } return []*v1.Pod{pod} } glog.Infof( "eviction manager: unable to evict any pods from the node" ) return nil } |
所有评论(0)