K8S-iptables与ipvs规则

仅学习笔记

daydayup9527

3209人浏览 · 2022-07-15 22:49:02

daydayup9527 · 2022-07-15 22:49:02 发布

K8S – lptables -ipvs 了解

lptables实现流量转发与负载均衡

lptables如何做流量转发?
>DNAT实现P地址和端口映射
iptables -t nat -A PREROUTING -d 1.2.3.4/32 --dport 80 -j DNAT -to-destination 10.20.30.40:8080

Iptables如何做负载均衡?
>statistic模块为每个后端设置权重,k8s默认的没有使用权重，随机的
iptables t nat-A PREROUTING -d 1.2.3.4 --dport 80 -m statistic -mode random -probability .25 -j DNAT --to-destination 10.20.30.40:8080

lptables如何做会话保持?
>recent模块设置会话保持时间
iptables t nat-A FO0 -m recent--rcheck --seconds 3600 --reap -name BAR -j BAR

lptables在Kubernetes的应用举例

kube-proxy 修改了filter和nat表，它对 iptables 的链进行了扩充，自定义了KUBE-SERVICES，KUBE-NODEPORTS，KUBE-POSTROUTING，KUBE-MARK-MASQ 和 KUBE-MARK-DROP 五个链，并主要通过为 KUBE-SERVICES 链（附着在PREROUTING 和 OUTPUT）增加 rule 来配制 traffic routing 规则

Cluster IP: Port -> PREROUTING(OUTPUT) -> KUBE-SERVICES ->KUBE-SVC-XXX-> KUBE-SEP-XXX -> Pod IP:Target Port

Chain PREROUTING (policy ACCEPT)
target        prot    opt  source     destination 
KUBE-SERVICES all      --  0.0.0.0/0  0.0.0.0/0

Chain KUBE-SERVICES (2 references)
target         prot    opt  source     destination
KUBE-SVC-6IM33IEVEEV703GP tcp --0.0.0.0/0 10.20.30.40 tcp dpt:80

Chain KUBE-SVC-6IM33IEVEEV703GP (1 references)
target   prot opt source   destination
KUBE-SEP-Q3UCPZ54E6Q2R4UT all -- 0.0.0.0/0  0.0.0.0/0

Chain KUBE-SEP-Q3UCPZ54E6Q2R4UT (1 references)
target   prot opt source  destination
DNAT     tcp  --  0.0.0.0/0  0.0.0.0/0  tcp to:172.17.0.2:8080
实现10.20.30.40：80   到 172.17.0.2:8080  DNAT转换

Iptables做负载均衡的问题

·规则线性匹配时延
-KUBE-SERVICES链挂了一长串KUBE-SVC-*链; 访问每个service，要遍历每条链直到匹配，时间复杂度O(N)
·规则更新时延  非增量式
·可扩展性
-当系统存在大量iptables规则链时，增加/删除规则(内核全量的copy出，更改再copy进去)会出现kernel lock
lemel lock: Another app is currently holding the xtables lock. Perhaps you want to use the-w option?
·可用性
-后端实例扩容，服务会话保持时间更新等都会导致连接断开。

更新lptables规则的时延

时延出现在哪?
-非增量式(全量更新),即使加上no-flush(iptables-restore)选项
-Kube-proxy定期同步iptables状态:
*拷贝所有规则:iptables-save
*在内存中更新规则
*在内核中修改规则:iptables-restore
*规则更新期间存在kernel 1ock

5K service (4OK 规则），增加一条iptables规则，耗时11min.
20K service (160K 规则），增加一条iptables规则，耗时5h

lptables周期性刷新导致TPS抖动

什么是IPVS (IP Virtual Server)
	Linux内核实现的L4 LB，LVS负载均衡的实现
    基于netfilter, hash table
	支持TCP, UDP, SCTP协议,IPV4, IPV6
	支持多种负载均衡策略 -rr, wrr, lc wlc, sh, dh, lblc...
	支持会话保持  persistent connection调度算法

IPVS三种转发模式
支持三种LB模式: Direct Routing(DR),Tunneling,NAT
- DR模式工作在L2，最快，但不支持端口映射
- Tunneling模式用IP包封装IP包，也称IPIP模式，不支持端口映射`
- DR和Tunneling模式，回程报文不会经过IPVS Director
- NAT模式支持端口映射，回程报文经过IPVS Director  内核原生版本只做DNAT，不做SNAT

IPVS做L4转发
1.绑定VIP
- dummy网卡
# ip link add dev dummy0 type dummy
#ip addr add 192.168.2.2/32 dev dummy0-
本地路由表
#ip route add to local 192.168.2.2/32 dev eth0 proto kernel
-网卡别名
袂ifconfig etho:1 192.168.2.2 netmask 255.255.255.255 up
2.创建IPVS Virtual Server
# ipvsadm -A t 192.168.60.200:80 -s rr -p 600   #-p回话保持600s
3.创建IPVS Real Server
# ipvsadm -a -t 192.168.60.200:80 -r 172.17.1.2:80 -m
# ipvsadm -a -t 192.168.60.200:80 -r 172.17.2.3:80 -m

Iptables VS. IPVS
lptables
√灵活,功能强大
√在prerouting, postrouting, forward, input, output不同阶段都能对包进行操作
IPVS
√更好的性能(hash vs. chain)
更多的负载均衡算法
-rr,wrr, Ic, wlc, ip hash...
√连接保持
- IPVS service更新期间,保持连接不断开
√预先加载内核模
-nf_ conntrack_ipv4, ip_vs, ip_vs_rr, ip_vs_wrr,ipvs_sh...√
# echo 1 >/proc/sys/net/ipv4/vs/conntrack    #打开，nat模式要使用

为什么还需要Iptables

因为我们访问了一层Service IP
Node IP-> Service IP ( Gateway) ->Container

客户端:(Node IP,Service IP),期望:(Service IP, Node IP) 
但实际，经过IPVS一层转发，包地址变成了(Node IP,Container)
服务端发出:(Container, Node IP)→这个包的源/目的地址与客户端期望的不一样!故将被丢弃

因此，需要一次SNAT(masquerade欺骗)!!source NAT
(Node IP,Service IP)->(IPVS director IP, Container)

这也是为什么IPVS NAT模式要求回程报文必须经过director 
提问:为什么Container A-> Cluster IP-> Container B?

IPSet 
ipset支持“增量”式增/删/改,而非iptables式全量更新， 把O(N)的iptables规则降为O(1)，不用想太多iptables...

ipset create KUBE-LOOP-BACK hash:ip,port,ip
ipset add KUBE-LOOP-BACK 192.168.1.1,udp:53,192.168.1.1
ipset add KUBE-LOOP-BACK 192.168.1.2,tcp:80,192.168.1.2
...

iptables  只写一条，匹配上面的KUBE-LOOP-BACK即可
iptablest nat-A POSTROUTING -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADEOUTING

ipset +IPVS+ iptables

service生成的iptables规则（了解，抄过来的，原帖找不到了）

cluster ip工作原理

调用方是某个node上面的某个POD，它向cluster ip 172.17.185.22发起调用，所以走的是OUTPUT链。

iptables要做的事情，就是识别dst（目标IP）是172.17.185.22的包，把dst改写成某一个Endpoints的IP地址（这个service背后对应2个endpoints）。

我们从output链开始看：

[root@10-42-8-102 ~]# iptables -t nat -nvL OUTPUT
Chain OUTPUT (policy ACCEPT 4 packets, 240 bytes)
 pkts bytes target     prot opt in     out     source               destination         
3424K  209M KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

直接交给KUBE-SERVICES链了，继续看：

[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
pkts bytes target     prot opt in     out     source               destination     
0     0 KUBE-MARK-MASQ  tcp  --  *      *      !172.17.0.0/16        172.17.185.22        /* zhongce-v2-0/testapi-smzdm-com: cluster IP */ tcp dpt:809
0     0 KUBE-SVC-G3OM5DSD2HHDMN6U  tcp  --  *      *       0.0.0.0/0            172.17.185.22        /* zhongce-v2-0/testapi-smzdm-com: cluster IP */ tcp dpt:809
10   520 KUBE-FW-G3OM5DSD2HHDMN6U  tcp  --  *      *       0.0.0.0/0            10.42.162.216        /* zhongce-v2-0/testapi-smzdm-com: loadbalancer IP */ tcp dpt:809

对结果进行了截取，只保留了属于该Service的规则

上述3条规则是顺序执行的：

第1条规则匹配发往Cluster IP 172.17.185.22的流量，跳转到了KUBE-MARK-MASQ链进一步处理，其作用就是打了一个MARK，稍后展开说明。
第2条规则匹配发往Cluster IP 172.17.185.22的流量，跳转到了KUBE-SVC-G3OM5DSD2HHDMN6U链进一步处理，稍后展开说明。
第3条规则匹配发往集群外LB IP的10.42.162.216的流量，跳转到了KUBE-FW-G3OM5DSD2HHDMN6U链进一步处理，稍后展开说明。

打MARK这一步还不是很重要，所以我们先说第2条规则，显然第2条规则对发往Cluster IP的流量要做dst IP改写到具体的一个Endpoints上面，我们看一下：

[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-SVC-G3OM5DSD2HHDMN6U
Chain KUBE-SVC-G3OM5DSD2HHDMN6U (3 references)
 pkts bytes target     prot opt in     out     source               destination         
   18   936 KUBE-SEP-JT2KW6YUTVPLLGV6  all  --  *      *       0.0.0.0/0            0.0.0.0/0            statistic mode random probability 0.50000000000
   21  1092 KUBE-SEP-VETLC6CJY2HOK3EL  all  --  *      *       0.0.0.0/0            0.0.0.0/0

一共2条规则，每条规则的应用概率是0.5，其实就是负载均衡流量到这2条链的意思，而这2条链分别对应各自endpoints的DNAT规则：

[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-SEP-JT2KW6YUTVPLLGV6
Chain KUBE-SEP-JT2KW6YUTVPLLGV6 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.42.147.255        0.0.0.0/0       
   26  1352 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp to:10.42.147.255:809
[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-SEP-VETLC6CJY2HOK3EL
Chain KUBE-SEP-VETLC6CJY2HOK3EL (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.42.38.222         0.0.0.0/0           2   104 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp to:10.42.38.222:809

这样就实现了在调用方完成负载均衡逻辑，也就是Cluster IP的工作原理了。

流量经过本机路由表决定从eth0 LAN出去（ucloud是underlay扁平网络）：

[root@10-42-8-102 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.42.0.1       0.0.0.0         UG    0      0        0 eth0
10.42.0.0       0.0.0.0         255.255.0.0     U     0      0        0 eth0

在流量离开本机之前会经过POSTROUTING链：

[root@10-42-8-102 ~]# iptables -t nat -nvL POSTROUTING
Chain POSTROUTING (policy ACCEPT 274 packets, 17340 bytes)
 pkts bytes target     prot opt in     out     source               destination         
 632M   36G KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
 
[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-POSTROUTING
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination         
  526 27352 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

其实直接就跳转到了KUBE-POSTROUTING，然后匹配打过0x4000 MARK的流量，将其做SNAT转换，而这个MARK其实就是之前没说的KUBE-MARK-MASQ做的事情：

[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-MARK-MASQ
Chain KUBE-MARK-MASQ (183 references)
 pkts bytes target     prot opt in     out     source               destination         
  492 25604 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

也就是说，当流量离开本机时，src IP会被修改为node的IP，而不是发出流量的POD IP了。

最后还有一个KUBE-FW-G3OM5DSD2HHDMN6U链没有讲，从本机发往LB IP的流量要做啥事情呢？

其实也是让流量直接发往具体某个Endpoints，就别真的发往LB了，这样才能获得最佳的延迟：

[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-FW-G3OM5DSD2HHDMN6U
Chain KUBE-FW-G3OM5DSD2HHDMN6U (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    2   104 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* zhongce-v2-0/testapi-smzdm-com: loadbalancer IP */
    2   104 KUBE-SVC-G3OM5DSD2HHDMN6U  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* zhongce-v2-0/testapi-smzdm-com: loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* zhongce-v2-0/testapi-smzdm-com: loadbalancer IP */

其中第2条规则就是复用了之前我们看到的负载均衡链。

nodeport工作原理

nodeport就是在每个node上开放同一个端口，只要集群外访问node上的该端口就可以访问到对应的service。

因此，iptables此时的目标是dst IP是node自己，并且dst port是nodeport端口的流量，将其直接改写到service的某个endpoints。

其实该匹配规则也是写在PREROUTING链上的，具体在KUBE-SERVICES链上，在之前的结果里面我没有贴出来而已：

[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination     
    0     0 KUBE-MARK-MASQ  tcp  --  *      *      !172.17.0.0/16        172.17.185.22        /* zhongce-v2-0/testapi-smzdm-com: cluster IP */ tcp dpt:809
    0     0 KUBE-SVC-G3OM5DSD2HHDMN6U  tcp  --  *      *       0.0.0.0/0            172.17.185.22        /* zhongce-v2-0/testapi-smzdm-com: cluster IP */ tcp dpt:809
   10   520 KUBE-FW-G3OM5DSD2HHDMN6U  tcp  --  *      *       0.0.0.0/0            10.42.162.216        /* zhongce-v2-0/testapi-smzdm-com: loadbalancer IP */ tcp dpt:809
    0     0 KUBE-NODEPORTS  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

处理nodeport的规则在末尾，只要dst IP是node本机IP的话（–dst-type LOCAL），就跳转到KUBE-NODEPORTS做进一步判定：

[root@10-42-8-102 ~]# iptables -t nat -nvL KUBE-NODEPORTS
Chain KUBE-NODEPORTS (1 references)
 pkts bytes target     prot opt in     out     source               destination        
    0     0 KUBE-MARK-MASQ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* zhongce-v2-0/testapi-smzdm-com: */ tcp dpt:39746
    0     0 KUBE-SVC-G3OM5DSD2HHDMN6U  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* zhongce-v2-0/testapi-smzdm-com: */ tcp dpt:39746

我只保留了与该service相关的2条规则：

第1条匹配dst port如果是39746，那么就打mark。
第2条匹配dst port如果是39746，那么就跳到负载均衡链做DNAT改写。
否则就不做任何处理，流量继续交给后续链条处理。

路由表

[root@10-42-8-102 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.42.0.1       0.0.0.0         UG    0      0        0 eth0
10.42.0.0       0.0.0.0         255.255.0.0     U     0      0        0 eth0
10.42.2.81      0.0.0.0         255.255.255.255 UH    0      0        0 vethc739b9e3
10.42.8.62      0.0.0.0         255.255.255.255 UH    0      0        0 veth1e24e60d
10.42.24.222    0.0.0.0         255.255.255.255 UH    0      0        0 veth95162021

因为使用ucloud的CNI插件，所以网络是一个underlay模式，Node和POD的IP段是直通的。

当被调用的node收到流量的时候，经过PREROUTING链什么也不做，经过路由表后可以判定转发给哪个虚拟网卡，从而将流量FORWARD到具体的docker容器内。

关于K8S Service的分析就到这里，我觉得比较重要的一点认识是：K8S集群内调用LB IP并不会真的发给LB，仍旧是直接走了集群内的负载均衡。