https://zhuanlan.zhihu.com/p/156138704
我们都知道在 Kubernetes 集群中通常会使用 Ingress 方案来统一代理集群内部的流量,而常用的 Ingress 方案为 traefik 和 nginx,和传统的 Nginx 作为企业内部的反向代理以及负载设备一样,在不同的场景下可能需要专有的配置才能满足需求,否则会出现奇奇怪怪的异常状态码。本篇文章一起来看下,我们在 traefik
中遇到的 500 和 502 异常。
在开始之前,我们先来看几个在 Nginx 作为反向代理工具中经常遇到的几个异常状态码:
客户端超时
上游未返回数据或未按照协议约定返回数据
),网关感觉自己没用了,返回了网关错误。一般是后端服务器宕机
或业务逻辑超时
Nginx网关作为客户端时的超时
)# 499 的实际情况就是,客户端指定超时时间为N秒,但是该请求在服务端实际需要执行M秒(M>N秒),客户端等的不耐烦了就关闭了
# 对于499状态来讲,解决方式是优化后端代码逻辑或者修改nginx参数
$ cat nginx.conf
proxy_ignore_client_abort on;
$ curl -i -m 3 http://127.0.0.1/hello.php
# 502的实际情况通常是Nginx网关后端的服务器直接宕机了(所以就拿不到上游的响应了)
# 当然也有可能是上游服务器真正的执行逻辑超过了上游服务器的超时时间限制(比如php-fpm.conf设置request_terminate_timeout5s,但是实际的业务逻辑需要7s才能完成),此时上游服务器端出现`业务逻辑超时`,给Nginx网关返回了异常的数据造成的
# 502时后端的几种错误日志
recv() failed (104: Connection reset by peer) while reading response header from upstream
upstream prematurely closed connection while reading response header from upstream
connect() failed (111: Connection refused) while connecting to upstream
# 整体来说502出现的问题通常是因为后端挂了,或者因为后端负载太高,暂时不可响应
# 可以在nginx侧增加proxy_read_timeout来暂时缓解
$ cat nginx.conf
proxy_read_timeout 20s;
# 504的实际情况就是客户端->Nginx->Backend,在过程中Nginx需要作为客户端访问Backend服务,但是在Backend还没用执行完成时,Nginx首先超过了自己的客户端超时时间,此时就会出现504的异常(但是对于客户端来说返回什么呢?)
# 对于504场景而言,通常的做法就是优化Backend的逻辑,适当减少执行时间;另外也可以适当的增加Nginx作为客户端时的超时时间
# 要知道,当Nginx作为客户端时,是以一个Proxy的角色存在的,配置如下参数即可
$ cat nginx.conf
uwsgi_connect_timeout 5;
uwsgi_send_timeout 5;
uwsgi_read_timeout 5;
fastcgi_read_timeout 5;
fastcgi_send_timeout 5;
proxy_connect_timeout 90;
proxy_send_timeout 90;
proxy_read_timeout 90;
我们当前集群的 traefik 的配置如下:
# traefik的configmap配置文件
$ cat traefik-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: traefik-config
namespace: kube-system
data:
traefik.toml: |
defaultEntryPoints = ["http","https"]
debug = false
logLevel = "INFO"
InsecureSkipVerify = true
[entryPoints]
[entryPoints.http]
address = ":80"
compress = true
[entryPoints.https]
address = ":443"
[entryPoints.https.tls]
[web]
address = ":8080"
[kubernetes]
[metrics]
[metrics.prometheus]
buckets=[0.1,0.3,1.2,5.0]
entryPoint = "traefik"
[ping]
entryPoint = "http"
# traefik的DaemonSet配置
$ cat traefik-ds-v1.7.16.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: traefik-ingress-controller
namespace: kube-system
---
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
name: traefik-ingress-controller
namespace: kube-system
labels:
k8s-app: traefik-ingress-lb
spec:
template:
metadata:
labels:
k8s-app: traefik-ingress-lb
name: traefik-ingress-lb
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
serviceAccountName: traefik-ingress-controller
terminationGracePeriodSeconds: 30
hostNetwork: true
containers:
- image: traefik:v1.7.16
name: traefik-ingress-lb
ports:
- name: http
containerPort: 80
hostPort: 80
- name: admin
containerPort: 8080
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
args:
- --api
- --kubernetes
- --logLevel=INFO
- --traefikLog.filePath=/logdata/traefik.log
- --configfile=/config/traefik.toml
- --accesslog.filepath=/logdata/access.log
- --accesslog.bufferingsize=100
volumeMounts:
- mountPath: /config
name: config
- mountPath: /logdata
name: access-log
volumes:
- configMap:
name: traefik-config
name: config
- name: access-log
hostPath:
path: /opt/logs/ingress/
---
kind: Service
apiVersion: v1
metadata:
name: traefik-ingress-service
namespace: kube-system
labels:
k8s-app: traefik-ingress-lb
spec:
selector:
k8s-app: traefik-ingress-lb
ports:
- protocol: TCP
port: 80
name: web
- protocol: TCP
port: 8080
name: admin
# 接口对外的ingress
$ kubectl get ingress -n s-data
NAME HOSTS ADDRESS PORTS AGE
data-api.bgbiao.cn data-api.bgbiao.cn 80 236d
ops.bgbiao.cn ops.bgbiao.cn 80 236d
# 测试对外接口
$ curl data-api.bgbiao.cn -i
HTTP/1.1 401 Unauthorized
Access-Control-Allow-Headers: Content-Type, X-TOKEN
Access-Control-Allow-Origin: *
Content-Length: 58
Content-Type: application/json
Vary: Accept-Encoding
Date: Sun, 28 Jun 2020 14:55:00 GMT
# 接口需要登录,那么我们对登录接口进行压测来模拟问题
$ curl -X POST --data '@/root/login.json' -H 'Content-type:application/json' http://data-api.bgbiao.cn/account/users/login/ -i
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Content-Type, X-TOKEN
Access-Control-Allow-Origin: *
Content-Length: 250
Content-Type: application/json
Vary: Accept-Encoding
Date: Sun, 28 Jun 2020 14:56:33 GMT
在服务部署完成后,一切皆正常,但是简单压测后发现服务出现部分请求失败。
# 使用ab工具进行压测
# 由压测结果可以发现,20个并发共压测200个请求,期间出现了7次失败请求
$ ab -c 20 -n 200 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/
...
Benchmarking data-api.bgbiao.cn (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests
Server Software:
Server Hostname: data-api.bgbiao.cn
Server Port: 80
Document Path: /account/users/login/
Document Length: 250 bytes
Concurrency Level: 20
Time taken for tests: 1.340 seconds
Complete requests: 200
Failed requests: 7
(Connect: 0, Receive: 0, Length: 7, Exceptions: 0)
Write errors: 0
Non-2xx responses: 7
Total transferred: 91371 bytes
Total body sent: 46400
HTML transferred: 48387 bytes
Requests per second: 149.21 [#/sec] (mean)
Time per request: 134.035 [ms] (mean)
Time per request: 6.702 [ms] (mean, across all concurrent requests)
Transfer rate: 66.57 [Kbytes/sec] received
33.81 kb/s sent
100.38 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 1 0.1 1 1
Processing: 2 116 27.8 114 179
Waiting: 2 116 27.8 114 179
Total: 3 117 27.8 116 180
Percentage of the requests served within a certain time (ms)
50% 116
66% 121
75% 125
80% 129
90% 154
95% 167
98% 173
99% 175
100% 180 (longest request)
# 将压测结果保存到文本中进行简单分析
# 简单分析在200个压测请求中,有4个请求失败,分别为2次500错误和2次502错误
$ ab -v 10 -c 20 -n 200 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/ > ab-log.txt
$ cat ab-log.txt | grep HTTP | sort| uniq -c
196 HTTP/1.0 200 OK
2 HTTP/1.0 500 Internal Server Error
2 HTTP/1.0 502 Bad Gateway
1 POST /account/users/login/ HTTP/1.0
前面我们提到了在 Nginx 的场景中的 500 和 502 状态码的原因以及相关的解决方式,那么在 Kubernetes 集群中,traefik 起到的作用和 Nginx 的作用也是相似的。
在开始的时候,我们提到了集群内的 traefik 的配置信息,对于 SRE 来讲,任何生产的服务都必须有相关的可观测性数据,因此,我们也默认将 traefik 的访问日志和进程日志进行了持久化 (分别对应 access.log和traefik.log
),同时也暴露了 traefik 的 prometheus 的 metrics 接口。
对于上面的压测请求,我们在访问日志里抓到了如下的异常日志:
$ tail -f access.log | grep data-api.bgbiao.cn | grep -v 'HTTP/1.0" 200'
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267376 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267385 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267410 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267418 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267484 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122267518 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:39 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267550 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 4ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272696 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 2ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272711 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272836 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 0ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272837 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms
可以看到,和我们压测结果里失败请求的状态码相同,都是 500
和 502
,虽然数量上不止 4 个,但这暂时不太重要。
通常对于大多数人来讲,在代理层看到 500 或者 502 都会下意识的想,肯定是上游服务的问题,不过这种猜测也能很快进行排除,排除的方法基本如下:
经过上述三种方式的测试后,我们基本排除了上游服务的问题,因此正式怀疑 traefik 本身是否有性能或者其他参数上的问题。
修改 traefki 中的日志级别为 DEBUG
:
$ cat traefik-ds-v1.7.6.yaml
....
args:
- --api
- --kubernetes
- --logLevel=DEBUG
....
然后在日志中看到如下相关信息:
# 500相关的日志
time="2020-06-28T15:35:05Z" level=debug msg="'500 Internal Server Error' caused by: http: server closed idle connection"
time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 500, Length: 21, duration: 1.486276ms"
# 502相关的日志
time="2020-06-28T15:35:05Z" level=debug msg="'502 Bad Gateway' caused by: EOF"
time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 502, Length: 11, duration: 1.530677ms"
上面的 500 信息,大概可以看出来是 traefik 的服务进程主动关闭了空闲链接导致的,而下面的 502 是因为 EOF,感觉像是没有读取完响应数据,就被断开了,导致 traefik 返回 502。通常这种情况在 Nginx 中很常见,也很容易调整相关的配置参数 (文章开始有提到),但是 traefik 的部署模式以及参数调整,还是需要花心思去看相关文档的。
然后在 github 上去翻 traefik 相关的 issues,发现该问题曾经出现过很多次。
一个是 500 的 issues,一个是 502 的 issues,但通常这两个问题都是成双出现的。
在上面第一个 issue 中提到,traefik 在 http 的反向代理功能中默认开启了 http 的 keep-alive 功能,但是 python 的应用中未开启 http 的 keep-alive,因为我们上面的测试程序其实也是使用的 python 开发的,先对该参数进行调整。
# 在uwsgi.ini中增加keep-alive参数即可
$ cat uwsgi.ini
[uwsgi]
http = 0.0.0.0:8080
http-keepalive = 1
chdir = /opt/app/
wsgi-file = /opt/app/main.py
callable = app
stats = 0.0.0.0:8081
processes = 2
threads = 10
# 重启应用后,再次进行压测
# 之前出现的502和500错误基本都消失了
# 并发200,共1万个请求,失败的请求数为0,总耗时1min(qps才到160 )
$ ab -c 200 -n 10000 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/
....
....
Concurrency Level: 200
Time taken for tests: 59.323 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 4670000 bytes
Total body sent: 2320000
HTML transferred: 2500000 bytes
Requests per second: 168.57 [#/sec] (mean)
Time per request: 1186.454 [ms] (mean)
Time per request: 5.932 [ms] (mean, across all concurrent requests)
Transfer rate: 76.88 [Kbytes/sec] received
38.19 kb/s sent
115.07 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 3 38.6 1 1035
Processing: 101 942 1457.7 857 32684
Waiting: 101 942 1457.7 857 32684
Total: 102 945 1458.0 861 32685
# p99达到7.3s
Percentage of the requests served within a certain time (ms)
50% 861
66% 1033
75% 1136
80% 1191
90% 1886
95% 2281
98% 4209
99% 7399
100% 32685 (longest request)
通过对业务层的 http 的 keep-alive 参数的开启,来暂时解决了 500 和 502 的问题,那能否通过 traefik 层的参数来优化该问题呢,上面第二个 issue 中其实也提到了。
即,通过修改 traefik 的如下几个参数,并重新部署整个 traefik 集群即可:
# 关闭traefik的keep-alive参数,参数默认为200,如果参数为0,则使用go标准库中的DefaultMaxIdleConnsPerHost参数
# keep-alive主要是用来复用链接来减少open files的,但是对于大量的短连接来将这种链接复用就可能出现上述情况
--maxidleconnsperhost=-1
# 即通过设置重试次数,增加空闲链接的超时时间,增加转发响应的超时时间,默认是0次
--retry.attempts=10
# 该参数已经替换为--respondingtimeouts.idletimeout参数了,默认为3m0s
--idletimeout=60s
# 默认是0s
--forwardingtimeouts.responseheadertimeout=60s
# traefik 空闲链接超时
$ ./traefik --help | grep idletimeout
--idletimeout (Deprecated) maximum amount of time an idle (keep-alive) connection will remain (default "0s")
--respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s")
# 响应超时相关参数
$ ./traefik --help | grep respondingtimeouts
--respondingtimeouts Timeouts for incoming requests to the Traefik instance (default "true")
--respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s")
--respondingtimeouts.readtimeout ReadTimeout is the maximum duration for reading the entire request, including (default "0s")
--respondingtimeouts.writetimeout WriteTimeout is the maximum duration before timing out writes of the response. (default "0s")
# 转发的超时
$ ./traefik --help | grep forwardingtimeouts
--forwardingtimeouts Timeouts for requests forwarded to the backend servers (default "true")
--forwardingtimeouts.dialtimeout The amount of time to wait until a connection to a backend server can be (default "30s")
--forwardingtimeouts.responseheadertimeout The amount of time to wait for a server's response headers after fully writing (default "0s")
最终修改后的 traefik 的参数如下:
# 可以根据实际情况考虑是否要关闭keep-alive 即增加参数: --maxidleconnsperhost=-1
- --api
- --kubernetes
- --logLevel=INFO
- --traefikLog.filePath=/logdata/traefik.log
- --configfile=/config/traefik.toml
- --accesslog.filepath=/logdata/access.log
- --accesslog.bufferingsize=100
- --forwardingtimeouts.responseheadertimeout=60s
- --respondingtimeouts.idletimeout=180s
- --retry.attempts=10
- --idletimeout=180s
在发现域名直接解析到 traefik 节点上不再出现 502 和 500 后,我们将 traefik 的节点挂载到阿里云的内网 slb 中,但是又开始出现了诡异的 503 问题。
接入 slb 后的简单压测情况 (内网使用的是免费低配的 slb )
# client->ali-slb->traefik->pods
$ ab -v 10 -c 200 -n 2000 -T 'application/json' -p postfile.json http://data-api.soulapp.cn/get_doc > slb-log.txt
$ cat slb-log.txt | grep 'HTTP/1.1 200' | wc -l
1322
$ cat slb-log.txt | grep 'HTTP/1.1 503' | wc -l
678
# client->traefik->pods
$ ab -v 10 -c 200 -n 2000 -T 'application/json' -p postfile.json http://data-api.c.bgbiao.cn/get_doc > traefik-log.txt
Completed 200 requests
Completed 400 requests
Completed 600 requests
Completed 800 requests
Completed 1000 requests
Completed 1200 requests
Completed 1400 requests
Completed 1600 requests
Completed 1800 requests
Completed 2000 requests
Finished 2000 requests
$ cat traefik-log.txt | grep 'HTTP/1.0 200' | wc -l
2000
后来在阿里云文档中看到如下文档,基本上就是在阿里云的 SLB 侧对不同规格的 SLB 做了一定的限流策略,此时会向客户端返回 503。