1.基础环境

1.1 AWS EKS 版本信息

云上AWS EKS 环境版本:  1.29.6 版本

操作系统: Amazon Linux 

内核版本: 5.10.220-209.869.amzn2.x86_64

Nginx镜像版本: nginx stable-alpine3.19


1.2 自建机房 kubernetes 版本

机房kubernetes 版本: 1.16.10 

操作系统: Centos 7 linux

内核版本: 4.18.16-1.e17.elerpo.86_64

Nginx镜像版本:  nginx stable-alpine3.19


2.问题现象描述

2.1 云上AWS 问题现象描述

1.因业务上云 Aws eks 在迁移过程发现一个问题 云上aws环境发现问题现象发现nginx镜像种反向代理
proxy_pass 域名为内部域名时候出现报错,反向代理其他域名例如: www.baidu.com 
或者www.goole.com 正常.

测试结论:
 1.配置proxy_pass外部共有域名都正常;
 2.容器内部使用dig nslookup 内部域名解析正常(无异常)
 3.集群node-local-dns和coredns 解析测试都正常

2.2 nginx -t 检查 错误日志:

 nginx -t 语法检查具体报错

#配置文件
location /test/{
  proxy_pass http://dev-example.K8S.cloud/;
}
#语法检查报错
nginx -t 
报错日志: 2024/10/02 08:25:03 67#67: host not found in upstream "dev-example.K8S.cloud" /etc/nginx/conf.d/default.conf:16
[emerg] 1#1: host not found in upstream "dev-example.K8S.cloud" in /etc/nginx/conf.d/default.conf:16
nginx: configuration file /etc/nginx/nginx.conf test failed


3 定位问题 

3.1 排除法

1.检查容器镜像系统是否可正常解析配置内部域名. (pod dig 验证正常解析)

#域名解析
dig dev-example.K8S.cloud 
server: 169.254.20.10  #node-local-dns dns节点cache
Adress: 169.254.20.10:53 
Non-authoritative answer:
Name: dev-example.K8S.cloud
Adress: 10.10.1.1

2.检查nginx 是否代理其他外部域名也出现问题 (配置转发其他外部域名-正常)

#配置文件
location /test/{
  proxy_pass http://www.baidu.com/;
}

#语法检查正常
nginx: configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is success

3.配置私有域名以后依旧出现出现域名解析找不到问题 (nginx在解析内部域名时候出现不识别问题)

#配置文件
location /test/{
  proxy_pass http://dev-example.K8S.cloud/;
}
#语法检查报错
nginx -t 
报错日志: 2024/10/02 08:25:03 67#67: host not found in upstream "dev-example.K8S.cloud" /etc/nginx/conf.d/default.conf:16
[emerg] 1#1: host not found in upstream "dev-example.K8S.cloud" in /etc/nginx/conf.d/default.conf:16
nginx: configuration file /etc/nginx/nginx.conf test failed

3.2 细节定位法 (抓包DNS解析请求)


1.打开一个POD终端查看
进入pod 安装 tcpdump 
apk add tcpdump  
#抓取dns 解析
tcpdump -i any -n -s 0 port 53 


2.打开一个另外POD终端运行nginx -t
nginx -t 
报错日志: 2024/10/02 08:25:03 67#67: host not found in upstream "dev-example.K8S.cloud" /etc/nginx/conf.d/default.conf:16
[emerg] 1#1: host not found in upstream "dev-example.K8S.cloud" in /etc/nginx/conf.d/default.conf:16
nginx: configuration file /etc/nginx/nginx.conf test failed

3.tcpdump 内部有问题域名 抓包过程如下  (dev-example.K8S.cloud)
tcpdump -i any -n -s 0 port 53
02:16:00.696096 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28640+ A? dev-example.K8S.cloud. (35)
02:16:00.696124 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.696242 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28640* 1/0/0 A 10.147.51.85 (68)
02:16:00.830250 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail 0/0/0 (35)
02:16:00.830287 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830359 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:00.830386 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830432 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:00.830455 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830499 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:00.830521 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830556 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197116 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197228 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197263 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197345 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197373 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197412 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197453 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197502 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197525 eth0  Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197562 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 vv 0/0/0 (35)

4.抓包正常外部域名 (www.baiud.com)
location /test/{
  proxy_pass http://www.baidu.com/;
}

#语法检查正常
nginx: configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is success

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
09:14:53.702737 eth0  Out IP 100.66.42.156.45529 > 169.254.20.10.53: 50229+ A? www.baidu.com. (31)
09:14:53.702769 eth0  Out IP 100.66.42.156.45529 > 169.254.20.10.53: 50648+ AAAA? www.baidu.com. (31)
09:14:53.704511 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.45529: 50229 4/0/0 CNAME www.a.shifen.com., CNAME www.wshifen.com., A 45.113.192.102, A 45.113.192.101 (181)
09:14:53.771089 eth0  In  IP 169.254.20.10.53 > 100.66.42.156.45529: 50648 2/1/0 CNAME www.a.shifen.com., CNAME www.wshifen.com. (207)

5.结论
以上问题 
 1.发现tcpdump 抓包过程种发现内部域名时候出现 请求 AAAA? ipv6 请 A? 时候并未找给出正确的dns解析返回值
 2.抓取外部域名直接 找到baidu域名相关的cname 域名
 

 stable-alpine3.19 镜像在处理内部域名是否触发查询多次ipv6出现问题导致A记录ipv4也出现问题
 
1.问题修复 (通过K8S 部署deployment文件修改容器内核参数关闭ipv6支持)
 

    

3.3 K8S 如何修复镜像ipv6问题

initContainers:
- command:
  - sh
  - -c
  - sysctl -w net.ipv6.conf.all.disable_ipv6=1 && sysctl -w net.ipv6.conf.default.disable_ipv6=1  #关闭ipv6 内核参数
  image: busybox
  imagePullPolicy: Always
  name: init-update-containers


3.4 尝试其他方案测试(更换nginx官网标准版本镜像)

nginx:latest (自测可修复ipv6问题 可正常解析内部域名)


3.5 使用k8s 外部 (ExternalName) nginx内部配置svc 地址

apiVersion: v1
kind: Service
metadata:
  labels:
  name: dev-example-external
spec:
  externalName: dev-example.K8S.cloud
  type: ExternalName


4.扩展通过调用过程

strace -o nginx_trace.log -tt -T -e trace=all nginx -t 2>&1