[转帖]Redis CPU profiling

redis,cpu,profiling · 浏览次数 : 0

小编点评

Sure, here's the summary you requested: **Default perf record behavior:** * Generates a perf.data file in the current working directory. * Offers basic filtering, sorting, and aggregation capabilities for analysis. **Visualizing perf.data with Flame Graphs:** * Flame graphs allow for quick and accurate visualization of frequent code paths. * Generated from folded stack files using Brendan Greg's programs. **Archiving and sharing recorded profile information:** * Allows analysis of the perf.data contents on other machines. * Requires `perf-archive.sh` for file extraction. **Using bcc/BPF for profile analysis:** * Provides count analysis without significant CPU overhead. * Outputs folded format for flame graph generation. **Calculating CPU usage:** * Uses `perf stat` to track various CPU metrics for a specified duration. * Specifically calculates instructions, cycles, stall cycles, and memory access statistics. **PMCs (Performance Monitoring Counters):** * PMCs offer low-level CPU performance statistics. * Use `perf list` to view supported PMCs and their capabilities. **Generating flame graphs from PMCs:** * `perf script` generates flame graphs from folded stack files. * Provides an `svg` output for visual representation. **Key takeaways:** * `perf record` generates perf.data files for analysis. * Flame graphs offer visual insights into code execution. * PMCs provide valuable performance information.

正文

https://redis.io/docs/management/optimization/cpu-profiling/

 

Performance engineering guide for on-CPU profiling and tracing

Filling the performance checklist

Redis is developed with a great emphasis on performance. We do our best with every release to make sure you'll experience a very stable and fast product.

Nevertheless, if you're finding room to improve the efficiency of Redis or are pursuing a performance regression investigation you will need a concise methodical way of monitoring and analyzing Redis performance.

To do so you can rely on different methodologies (some more suited than other depending on the class of issues/analysis we intent to make). A curated list of methodologies and their steps are enumerated by Brendan Greg at the following link.

We recommend the Utilization Saturation and Errors (USE) Method for answering the question of what is your bottleneck. Check the following mapping between system resource, metric, and tools for a pratical deep dive: USE method.

Ensuring the CPU is your bottleneck

This guide assumes you've followed one of the above methodologies to perform a complete check of system health, and identified the bottleneck being the CPU. If you have identified that most of the time is spent blocked on I/O, locks, timers, paging/swapping, etc., this guide is not for you.

Build Prerequisites

For a proper On-CPU analysis, Redis (and any dynamically loaded library like Redis Modules) requires stack traces to be available to tracers, which you may need to fix first.

By default, Redis is compiled with the -O2 switch (which we intent to keep during profiling). This means that compiler optimizations are enabled. Many compilers omit the frame pointer as a runtime optimization (saving a register), thus breaking frame pointer-based stack walking. This makes the Redis executable faster, but at the same time it makes Redis (like any other program) harder to trace, potentially wrongfully pinpointing on-CPU time to the last available frame pointer of a call stack that can get a lot deeper (but impossible to trace).

It's important that you ensure that:

  • debug information is present: compile option -g
  • frame pointer register is present: -fno-omit-frame-pointer
  • we still run with optimizations to get an accurate representation of production run times, meaning we will keep: -O2

You can do it as follows within redis main repo:

$ make REDIS_CFLAGS="-g -fno-omit-frame-pointer"

A set of instruments to identify performance regressions and/or potential on-CPU performance improvements

This document focuses specifically on on-CPU resource bottlenecks analysis, meaning we're interested in understanding where threads are spending CPU cycles while running on-CPU and, as importantly, whether those cycles are effectively being used for computation or stalled waiting (not blocked!) for memory I/O, and cache misses, etc.

For that we will rely on toolkits (perf, bcc tools), and hardware specific PMCs (Performance Monitoring Counters), to proceed with:

  • Hotspot analysis (pref or bcc tools): to profile code execution and determine which functions are consuming the most time and thus are targets for optimization. We'll present two options to collect, report, and visualize hotspots either with perf or bcc/BPF tracing tools.

  • Call counts analysis: to count events including function calls, enabling us to correlate several calls/components at once, relying on bcc/BPF tracing tools.

  • Hardware event sampling: crucial for understanding CPU behavior, including memory I/O, stall cycles, and cache misses.

Tool prerequesits

The following steps rely on Linux perf_events (aka "perf"), bcc/BPF tracing tools, and Brendan Greg’s FlameGraph repo.

We assume beforehand you have:

  • Installed the perf tool on your system. Most Linux distributions will likely package this as a package related to the kernel. More information about the perf tool can be found at perf wiki.
  • Followed the install bcc/BPF instructions to install bcc toolkit on your machine.
  • Cloned Brendan Greg’s FlameGraph repo and made accessible the difffolded.pl and flamegraph.pl files, to generated the collapsed stack traces and Flame Graphs.

Hotspot analysis with perf or eBPF (stack traces sampling)

Profiling CPU usage by sampling stack traces at a timed interval is a fast and easy way to identify performance-critical code sections (hotspots).

Sampling stack traces using perf

To profile both user- and kernel-level stacks of redis-server for a specific length of time, for example 60 seconds, at a sampling frequency of 999 samples per second:

$ perf record -g --pid $(pgrep redis-server) -F 999 -- sleep 60

Displaying the recorded profile information using perf report

By default perf record will generate a perf.data file in the current working directory.

You can then report with a call-graph output (call chain, stack backtrace), with a minimum call graph inclusion threshold of 0.5%, with:

$ perf report -g "graph,0.5,caller"

See the perf report documentation for advanced filtering, sorting and aggregation capabilities.

Visualizing the recorded profile information using Flame Graphs

Flame graphs allow for a quick and accurate visualization of frequent code-paths. They can be generated using Brendan Greg's open source programs on github, which create interactive SVGs from folded stack files.

Specifically, for perf we need to convert the generated perf.data into the captured stacks, and fold each of them into single lines. You can then render the on-CPU flame graph with:

$ perf script > redis.perf.stacks
$ stackcollapse-perf.pl redis.perf.stacks > redis.folded.stacks
$ flamegraph.pl redis.folded.stacks > redis.svg

By default, perf script will generate a perf.data file in the current working directory. See the perf script documentation for advanced usage.

See FlameGraph usage options for more advanced stack trace visualizations (like the differential one).

Archiving and sharing recorded profile information

So that analysis of the perf.data contents can be possible on a machine other than the one on which collection happened, you need to export along with the perf.data file all object files with build-ids found in the record data file. This can be easily done with the help of perf-archive.sh script:

$ perf-archive.sh perf.data

Now please run:

$ tar xvf perf.data.tar.bz2 -C ~/.debug

on the machine where you need to run perf report.

Sampling stack traces using bcc/BPF's profile

Similarly to perf, as of Linux kernel 4.9, BPF-optimized profiling is now fully available with the promise of lower overhead on CPU (as stack traces are frequency counted in kernel context) and disk I/O resources during profiling.

Apart from that, and relying solely on bcc/BPF's profile tool, we have also removed the perf.data and intermediate steps if stack traces analysis is our main goal. You can use bcc's profile tool to output folded format directly, for flame graph generation:

$ /usr/share/bcc/tools/profile -F 999 -f --pid $(pgrep redis-server) --duration 60 > redis.folded.stacks

In that manner, we've remove any preprocessing and can render the on-CPU flame graph with a single command:

$ flamegraph.pl redis.folded.stacks > redis.svg

Visualizing the recorded profile information using Flame Graphs

Call counts analysis with bcc/BPF

A function may consume significant CPU cycles either because its code is slow or because it's frequently called. To answer at what rate functions are being called, you can rely upon call counts analysis using BCC's funccount tool:

$ /usr/share/bcc/tools/funccount 'redis-server:(call*|*Read*|*Write*)' --pid $(pgrep redis-server) --duration 60
Tracing 64 functions for "redis-server:(call*|*Read*|*Write*)"... Hit Ctrl-C to end.

FUNC                                    COUNT
call                                      334
handleClientsWithPendingWrites            388
clientInstallWriteHandler                 388
postponeClientRead                        514
handleClientsWithPendingReadsUsingThreads      735
handleClientsWithPendingWritesUsingThreads      735
prepareClientToWrite                     1442
Detaching...

The above output shows that, while tracing, the Redis's call() function was called 334 times, handleClientsWithPendingWrites() 388 times, etc.

Hardware event counting with Performance Monitoring Counters (PMCs)

Many modern processors contain a performance monitoring unit (PMU) exposing Performance Monitoring Counters (PMCs). PMCs are crucial for understanding CPU behavior, including memory I/O, stall cycles, and cache misses, and provide low-level CPU performance statistics that aren't available anywhere else.

The design and functionality of a PMU is CPU-specific and you should assess your CPU supported counters and features by using perf list.

To calculate the number of instructions per cycle, the number of micro ops executed, the number of cycles during which no micro ops were dispatched, the number stalled cycles on memory, including a per memory type stalls, for the duration of 60s, specifically for redis process:

$ perf stat -e "cpu-clock,cpu-cycles,instructions,uops_executed.core,uops_executed.stall_cycles,cache-references,cache-misses,cycle_activity.stalls_total,cycle_activity.stalls_mem_any,cycle_activity.stalls_l3_miss,cycle_activity.stalls_l2_miss,cycle_activity.stalls_l1d_miss" --pid $(pgrep redis-server) -- sleep 60

Performance counter stats for process id '3038':

  60046.411437      cpu-clock (msec)          #    1.001 CPUs utilized          
  168991975443      cpu-cycles                #    2.814 GHz                      (36.40%)
  388248178431      instructions              #    2.30  insn per cycle           (45.50%)
  443134227322      uops_executed.core        # 7379.862 M/sec                    (45.51%)
   30317116399      uops_executed.stall_cycles #  504.895 M/sec                    (45.51%)
     670821512      cache-references          #   11.172 M/sec                    (45.52%)
      23727619      cache-misses              #    3.537 % of all cache refs      (45.43%)
   30278479141      cycle_activity.stalls_total #  504.251 M/sec                    (36.33%)
   19981138777      cycle_activity.stalls_mem_any #  332.762 M/sec                    (36.33%)
     725708324      cycle_activity.stalls_l3_miss #   12.086 M/sec                    (36.33%)
    8487905659      cycle_activity.stalls_l2_miss #  141.356 M/sec                    (36.32%)
   10011909368      cycle_activity.stalls_l1d_miss #  166.736 M/sec                    (36.31%)

  60.002765665 seconds time elapsed

It's important to know that there are two very different ways in which PMCs can be used (counting and sampling), and we've focused solely on PMCs counting for the sake of this analysis. Brendan Greg clearly explains it on the following link.

与[转帖]Redis CPU profiling相似的内容:

[转帖]Redis CPU profiling

https://redis.io/docs/management/optimization/cpu-profiling/ Performance engineering guide for on-CPU profiling and tracing Filling the performance ch

[转帖]Redis如何绑定CPU

文章系转载,便于分类和归纳,源文地址:https://www.yisu.com/zixun/672271.html 绑定 CPU Redis 6.0 开始支持绑定 CPU,可以有效减少线程上下文切换。 CPU 亲和性(CPU Affinity)是一种调度属性,它将一个进程或线程,「绑定」到一个或一组

[转帖]Redis如何绑定CPU

https://www.yisu.com/zixun/672271.html 这篇文章主要介绍了Redis如何绑定CPU,具有一定借鉴价值,感兴趣的朋友可以参考下,希望大家阅读完这篇文章之后大有收获,下面让小编带着大家一起了解一下。 绑定 CPU Redis 6.0 开始支持绑定 CPU,可以有效减

[转帖]Redis如何绑定CPU

文章系转载,便于分类和归纳,源文地址:https://www.yisu.com/zixun/672271.html 绑定 CPU Redis 6.0 开始支持绑定 CPU,可以有效减少线程上下文切换。 CPU 亲和性(CPU Affinity)是一种调度属性,它将一个进程或线程,「绑定」到一个或一组

[转帖]redis最大连接和CPU使用过高

https://www.jianshu.com/p/bca85370c808 redis默认最大连接数为10000 redis 使用的cpu过高是因为: 1.存在慢查询语句 slowlog get 10 获取慢查询语句 slowlog len 查看保存了多少慢查询语句 2.连接数量过多,导致要执行的

[转帖]解码Redis最易被忽视的CPU和内存占用高问题

https://ost.51cto.com/posts/12514 我们在使用Redis时,总会碰到一些redis-server端CPU及内存占用比较高的问题。下面以几个实际案例为例,来讨论一下在使用Redis时容易忽视的几种情形。 一、短连接导致CPU高 某用户反映QPS不高,从监控看CPU确实偏

[转帖]Redis 禁用 危险命令

一、Redis 危险命令 keys * :虽然其模糊匹配功能使用非常方便也很强大,在小数据量情况下使用没什么问题,数据量大会导致 Redis 锁住及 CPU 飙升,在生产环境建议禁用或者重命名!flushdb :删除 Redis 中当前所在数据库中的所有记录,并且此命令从不会执行失败flushall

[转帖]redis进程绑定指定的CPU核

文章系转载,便于分类和归纳,源文地址:https://blog.csdn.net/youlinhuanyan/article/details/99671878 1)查看某服务的pid $ ps -aux|grep redis 1)显示进程运行的CPU #命令 $ taskset -p 21184 显

[转帖]Redis子进程开销与优化

Redis子进程开销与优化 文章系转载,便于分类和归纳,源文地址:https://blog.csdn.net/y532798113/article/details/106870299 1、CPU 开销 RDB和AOF文件生成,属于CPU密集型 优化 不做CPU绑定,也就是不把redis进程绑定在一个

[转帖]Redis中Key的过期策略和淘汰机制

Key的过期策略 Redis的Key有3种过期删除策略,具体如下: 1. 定时删除 原理:在设置键的过期时间的同时,创建一个定时器(timer),让定时器在键的过期时间来临时,立即执行对键的删除操作优点:能够很及时的删除过期的Key,能够最大限度的节约内存缺点:对CPU时间不友好,如果过期的Key比