磁盘老化。Disk just get slowdown time to time for no reason. The Tail at Store gives more in-depth analysis. Also, disks may degrade significantly when they get old.
超时。Failure tolerance and retry is a common design pattern in distributed systems. But one retry is enough to send current request to latency tail. Google SRE Book chapter 21 to 22 discuss it in detail, such as,
Reduce remaining timeout quota and pass it down each layer of the request processing chain.
Be aware of the chained retry amplification (layer1 3 retries, layer2 3*3 retries, …).
后台任务。Almost every services, from software to even hardware/firmware, have backgroud tasks. Background task may temporarily slowdown the world. The most notorious one is GC (garbage collection,垃圾回收).
超负载运行。The customer may be sending you too many/big requests, and upper layer throttling is not working well. Overprovisioned customer VMs may compete with each other resulting slow experience. Some small piece of data may be extremly hot, e.g. many OS images are forked from a small shared base. A large request may be pegging your CPU/network/disk, and make the others queuing up. Or something went wrong, as a dead loop stuck your cpu.
发送比必要更多的请求,只收集最快的返回,有助于减少尾部。Send 2 instread of 1. Send 11 instead of 10 (e.g. in erasure-coding 10 fragment reconstruct read). Send backup requests at 95% percentile latency.
金丝雀请求,,i.e. send normal requests but fallback to sending hedged requests if the canary did’t finish in reasonable time.