pod所在节点的dns解析有问题

问题现象

在某个节点上的pod在进行外部通信的时候报错,具体错误是:

ERROR: [generic] Unable to download webpage: <urlopen error [Errno -3] Temporary failure in name resolution> (caused by URLError(gaierror(-3, 'Temporary failure in name resolution')))

意思是在进行dns解析的时候出现错误。

排查

在问题节点上启动busybox尝试ping其他节点上无头服务,发现无法ping通,初步断定是当前节点的dns解析出现了问题

root@tencent-beijing-master:~# kubectl exec -it busybox-vj29p -- sh
/ # ping postgresql-hl.cloud
ping: bad address 'postgresql-hl.cloud'
/ # ping mysql-primary-headless
^C
/ # ping mysql-primary-headless.cloud
ping: bad address 'mysql-primary-headless.cloud'
/ # exit
/ ping mysql-primary-headless.cloud
^C
/ ping baidu.com
^C
/ # 

在其他节点上ping却可以ping通

root@tencent-beijing-master:~# kubectl exec -it busybox-4zp7g -- sh
/ # ping postgresql-hl.cloud
PING postgresql-hl.cloud (100.93.145.75): 56 data bytes
64 bytes from 100.93.145.75: seq=0 ttl=62 time=3.333 ms
64 bytes from 100.93.145.75: seq=1 ttl=62 time=3.185 ms
64 bytes from 100.93.145.75: seq=2 ttl=62 time=3.247 ms
64 bytes from 100.93.145.75: seq=3 ttl=62 time=3.247 ms
^C
--- postgresql-hl.cloud ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 3.185/3.253/3.333 ms
/ # 
/ # ping baidu.com
PING baidu.com (110.242.68.66): 56 data bytes
64 bytes from 110.242.68.66: seq=0 ttl=250 time=12.091 ms
64 bytes from 110.242.68.66: seq=1 ttl=250 time=12.163 ms
64 bytes from 110.242.68.66: seq=2 ttl=250 time=12.187 ms
64 bytes from 110.242.68.66: seq=3 ttl=250 time=12.153 ms
^C
--- baidu.com ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 12.091/12.148/12.187 ms
/ # 

发现也无法ping集群内部pod ip

root@tencent-beijing-master:~# kubectl exec -it busybox-vj29p -- sh
/ # ping 100.96.12.201
PING 100.96.12.201 (100.96.12.201): 56 data bytes
^C
--- 100.96.12.201 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
/ # exit
command terminated with exit code 1
root@tencent-beijing-master:~# 

其他节点的却可以ping通

root@tencent-beijing-master:~# kubectl exec -it busybox-4zp7g -- sh
/ # ping 100.96.12.201
PING 100.96.12.201 (100.96.12.201): 56 data bytes
64 bytes from 100.96.12.201: seq=0 ttl=63 time=0.141 ms
64 bytes from 100.96.12.201: seq=1 ttl=63 time=0.068 ms
64 bytes from 100.96.12.201: seq=2 ttl=63 time=0.078 ms
64 bytes from 100.96.12.201: seq=3 ttl=63 time=0.070 ms
^C
--- 100.96.12.201 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.068/0.089/0.141 ms
/ # exit

于是怀疑是该节点CNI网络出现了问题

pod所采的cali网卡抓包,因为该pod是用于录制主播的实时直播,所以会发送直播地址的dns请求,为了排除干扰,通过daemonset方式在每个节点上启动busybox,通过它们进行排错

root@ali-qingdao-worker07:~# tcpdump -i cali870e7b4d1d0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on cali870e7b4d1d0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:57:56.008153 ARP, Request who-has 169.254.1.1 tell 100.78.15.43, length 28
17:57:56.008184 ARP, Reply 169.254.1.1 is-at ee:ee:ee:ee:ee:ee (oui Unknown), length 28
17:57:56.310346 IP 100.78.15.43.47826 > 10.96.0.10.domain: 12109+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:57:56.317791 IP 100.78.15.43.47826 > 10.96.0.10.domain: 4674+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:01.322897 IP 100.78.15.43.47826 > 10.96.0.10.domain: 12109+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:58:01.322993 IP 100.78.15.43.47826 > 10.96.0.10.domain: 4674+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:06.329634 IP 100.78.15.43.41026 > 10.96.0.10.domain: 26791+ A? www.panda.tv. (30)
17:58:06.329878 IP 100.78.15.43.41026 > 10.96.0.10.domain: 64928+ AAAA? www.panda.tv. (30)
17:58:11.333747 IP 100.78.15.43.41026 > 10.96.0.10.domain: 26791+ A? www.panda.tv. (30)
17:58:11.333838 IP 100.78.15.43.41026 > 10.96.0.10.domain: 64928+ AAAA? www.panda.tv. (30)
17:58:15.935424 IP 100.78.15.43.35886 > 10.96.0.10.domain: 2808+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:58:15.935596 IP 100.78.15.43.35886 > 10.96.0.10.domain: 40435+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:15.935803 IP 100.78.15.43.56780 > 10.96.0.10.domain: 58556+ A? gql.twitch.tv.lab.svc.cluster.local. (53)
17:58:15.935883 IP 100.78.15.43.56780 > 10.96.0.10.domain: 26558+ AAAA? gql.twitch.tv.lab.svc.cluster.local. (53)
17:58:16.605852 IP 100.78.15.43.37809 > 10.96.0.10.domain: 7905+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:58:16.606024 IP 100.78.15.43.37809 > 10.96.0.10.domain: 28652+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:20.940192 IP 100.78.15.43.35886 > 10.96.0.10.domain: 2808+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:58:20.940323 IP 100.78.15.43.35886 > 10.96.0.10.domain: 40435+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:20.940925 IP 100.78.15.43.56780 > 10.96.0.10.domain: 58556+ A? gql.twitch.tv.lab.svc.cluster.local. (53)
17:58:20.940967 IP 100.78.15.43.56780 > 10.96.0.10.domain: 26558+ AAAA? gql.twitch.tv.lab.svc.cluster.local. (53)
17:58:21.611045 IP 100.78.15.43.37809 > 10.96.0.10.domain: 7905+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:58:21.611190 IP 100.78.15.43.37809 > 10.96.0.10.domain: 28652+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:25.945428 IP 100.78.15.43.49270 > 10.96.0.10.domain: 64261+ A? gql.twitch.tv. (31)
17:58:25.945428 IP 100.78.15.43.44690 > 10.96.0.10.domain: 45921+ A? www.panda.tv. (30)
17:58:25.945597 IP 100.78.15.43.44690 > 10.96.0.10.domain: 64868+ AAAA? www.panda.tv. (30)
17:58:25.945633 IP 100.78.15.43.49270 > 10.96.0.10.domain: 41595+ AAAA? gql.twitch.tv. (31)
17:58:25.960148 ARP, Request who-has 169.254.1.1 tell 100.78.15.43, length 28
17:58:25.960193 ARP, Reply 169.254.1.1 is-at ee:ee:ee:ee:ee:ee (oui Unknown), length 28
17:58:26.616290 IP 100.78.15.43.59321 > 10.96.0.10.domain: 11285+ A? www.panda.tv. (30)
17:58:26.616504 IP 100.78.15.43.59321 > 10.96.0.10.domain: 52252+ AAAA? www.panda.tv. (30)
17:58:30.946429 IP 100.78.15.43.49270 > 10.96.0.10.domain: 64261+ A? gql.twitch.tv. (31)
17:58:30.946597 IP 100.78.15.43.49270 > 10.96.0.10.domain: 41595+ AAAA? gql.twitch.tv. (31)
17:58:30.950330 IP 100.78.15.43.44690 > 10.96.0.10.domain: 45921+ A? www.panda.tv. (30)
17:58:30.950419 IP 100.78.15.43.44690 > 10.96.0.10.domain: 64868+ AAAA? www.panda.tv. (30)
17:58:31.621315 IP 100.78.15.43.59321 > 10.96.0.10.domain: 11285+ A? www.panda.tv. (30)
17:58:31.621426 IP 100.78.15.43.59321 > 10.96.0.10.domain: 52252+ AAAA? www.panda.tv. (30)
17:58:36.303612 IP 100.78.15.43.57449 > 10.96.0.10.domain: 59787+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:58:36.303777 IP 100.78.15.43.57449 > 10.96.0.10.domain: 20110+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:41.311159 IP 100.78.15.43.57449 > 10.96.0.10.domain: 59787+ A? www.panda.tv.lab.svc.cluster.local. (52)
17:58:41.311346 IP 100.78.15.43.57449 > 10.96.0.10.domain: 20110+ AAAA? www.panda.tv.lab.svc.cluster.local. (52)
17:58:46.320251 IP 100.78.15.43.49610 > 10.96.0.10.domain: 29691+ A? www.panda.tv. (30)
17:58:46.320457 IP 100.78.15.43.49610 > 10.96.0.10.domain: 35827+ AAAA? www.panda.tv. (30)

问题解决

重启了CNI插件calico的daemonset所在的问题节点,pod可以成功ping通

tag(s): none
show comments · back · home
Edit with markdown