periodSeconds
和timeoutSeconds
kubernetes probes的配置中有两个容易混淆的参数,periodSeconds
和timeoutSeconds
,其配置方式如下:
apiVersion: v1
kind: Pod
metadata:
name: darwin-app
spec:
containers:
- name: darwin-container
image: darwin-image
livenessProbe:
httpGet:
path: /darwin-path
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
官方对这两个参数的解释如下:
periodSeconds
: How often (in seconds) to perform the probe. Default to 10 seconds. The minimum value is 1.timeoutSeconds
: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
意思是说periodSeconds
表示执行探针的周期,而timeoutSeconds
表示执行探针的超时时间。
网上有不少针对这两个参数的讨论(如下),其中涉及到一个问题,如果timeoutSeconds
> periodSeconds
会怎么样?
其中在上面的第3篇中对timeoutSeconds
>periodSeconds
的情况有如下描述,即在这种情况下,如果探针超时,则探针周期等于timeoutSeconds
。那么这种说法是否正确呢?
If you had the opposite (
timeoutSeconds=10
,periodSeconds=5
), then the probes would look as follows:0s: liveness probe initiated 10s: liveness probe times out 10s: liveness probe initiated again
鉴于网上众说纷纭,我们通过源码来一探究竟。
kubernetes的探针机制是由kubelet执行的,目前支持exec
、grpc
、httpGet
、tcpSocket
这4种探针方式。
探针的代码逻辑并不复杂,以v1.30.2的代码为例,其入口函数如下,可以看到它会启动一个周期为w.spec.PeriodSeconds
(即探针中定义的periodSeconds
)定时器,周期性地执行探针。
// run periodically probes the container.
func (w *worker) run() {
ctx := context.Background()
probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second
...
probeTicker := time.NewTicker(probeTickerPeriod)
...
probeLoop:
for w.doProbe(ctx) {
// Wait for next probe tick.
select {
case <-w.stopCh:
break probeLoop
case <-probeTicker.C:
case <-w.manualTriggerCh:
// continue
}
}
}
现在已经找到periodSeconds
的用途,下一步需要找到timeoutSeconds
。
首先进入doProbe
函数,它调用了w.probeManager.prober.probe
:
// doProbe probes the container once and records the result.
// Returns whether the worker should continue.
func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {
...
// Note, exec probe does NOT have access to pod environment variables or downward API
result, err := w.probeManager.prober.probe(ctx, w.probeType, w.pod, status, w.container, w.containerID)
if err != nil {
// Prober error, throw away the result.
return true
}
...
}
下面的probe
函数用于执行一个特定的探针。需要注意的是,它调用了pb.runProbeWithRetries
,其中maxProbeRetries
值为3,说明在一个周期(periodSeconds
)中最多可以执行3次探针命令:
// probe probes the container.
func (pb *prober) probe(ctx context.Context, probeType probeType, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (results.Result, error) {
var probeSpec *v1.Probe
switch probeType {
case readiness:
probeSpec = container.ReadinessProbe
case liveness:
probeSpec = container.LivenessProbe
case startup:
probeSpec = container.StartupProbe
default:
return results.Failure, fmt.Errorf("unknown probe type: %q", probeType)
}
...
result, output, err := pb.runProbeWithRetries(ctx, probeType, probeSpec, pod, status, container, containerID, maxProbeRetries)
...
}
runProbeWithRetries
的注释说明,可能会执行多次探针,直到探针返回成功或全部尝试失败:
// runProbeWithRetries tries to probe the container in a finite loop, it returns the last result
// if it never succeeds.
func (pb *prober) runProbeWithRetries(ctx context.Context, probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID, retries int) (probe.Result, string, error) {
...
for i := 0; i < retries; i++ {
result, output, err = pb.runProbe(ctx, probeType, p, pod, status, container, containerID)
...
}
...
}
在runProbe
函数中,最终找到了timeoutSeconds
对应的参数p.TimeoutSeconds
,其作为各个探针命令的超时参数,如在httpGet
类型的探针中,它作为了httpClient
的请求超时时间:
func (pb *prober) runProbe(ctx context.Context, probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (probe.Result, string, error) {
timeout := time.Duration(p.TimeoutSeconds) * time.Second
if p.Exec != nil {
command := kubecontainer.ExpandContainerCommandOnlyStatic(p.Exec.Command, container.Env)
return pb.exec.Probe(pb.newExecInContainer(ctx, container, containerID, command, timeout))
}
if p.HTTPGet != nil {
req, err := httpprobe.NewRequestForHTTPGetAction(p.HTTPGet, &container, status.PodIP, "probe")
...
return pb.http.Probe(req, timeout)
}
if p.TCPSocket != nil {
port, err := probe.ResolveContainerPort(p.TCPSocket.Port, &container)
...
host := p.TCPSocket.Host
if host == "" {
host = status.PodIP
}
return pb.tcp.Probe(host, port, timeout)
}
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.GRPCContainerProbe) && p.GRPC != nil {
host := status.PodIP
service := ""
if p.GRPC.Service != nil {
service = *p.GRPC.Service
}
return pb.grpc.Probe(host, service, int(p.GRPC.Port), timeout)
}
...
}
至此我们可以拼接出periodSeconds
和timeoutSeconds
的关系,其逻辑关系与如下代码类似。
probeTicker := time.NewTicker(periodSeconds)
for {
select {
case <-probeTicker.C:
for i := 0; i < 3; i++ {
if ok:=probe(timeoutSeconds);ok{
return
}
}
}
periodSeconds
用于启动一个周期性调用探针命令的定时器,而timeoutSeconds
作为探针命令的超时参数timeoutSeconds
和periodSeconds
之间并没有明确的关系。如果timeoutSeconds
=10s,periodSeconds
=5s,则本次探针周期可能为[5s, 30s)之内的任意值,并不是该文中说的periodSeconds=timeoutSeconds
(由于本文写于3年前,经查阅v1.19.10
版本代码,逻辑上与现有版本代码相同。)/hleathz
http endpoint是否可以访问等,因此建议将timeoutSeconds
设置为一个小于periodSeconds
的合理的值。failureThreshold/successThreshold
和maxProbeRetries
的关系maxProbeRetries
用于定义一次探针周期内探针命令执行的最大尝试次数;successThreshold
加1,反之failureThreshold
加1;