[转帖]C2C - False Sharing Detection in Linux Perf

c2c,false,sharing,detection,in,linux,perf · 浏览次数 : 0

小编点评

**Shared Cache Line Distribution Pareto table** The Shared Cache Line Distribution Pareto table provides great insights into false sharing contention and performance bottlenecks. It can help identify: * Hot locks and mutexes that are unaligned and spill into multiple cachelines * Read mostly variables that can be grouped together into their own cachelines * Cacheline interference caused by shared memory access * False sharing contention between different processes using shared memory **How to use the Shared Cache Line Distribution Pareto table** 1. Look at the nodes and cpus the samples for those accesses are coming from. 2. Identify heavily modified variables that need to be placed into their own cachelines. 3. Spot hot locks or mutexes that are unaligned and spill into multiple cachelines. 4. Identify "read mostly" variables that can be grouped together into their own cachelines. 5. Take a peek at the raw instruction samples contained in the perf.data file. **Tips for using the Shared Cache Line Distribution Pareto table** * Use the raw sample information to identify hot locks and mutexes. * Focus on cacheline interference caused by shared memory access. * Identify "read mostly" variables that can be grouped together into their own cachelines. * Look for patterns in the information provided for each cacheline to identify anomalies.

正文

https://joemario.github.io/blog/2016/09/01/c2c-blog/

 

Do you run your application in a NUMA environment? Is it multi-threaded? Is it multi-process with shared memory? If so, is your performance impacted by false sharing?

Now there’s a way to easily find out. We’re posting patches for a new feature to the Linux perf tool, called “c2c” for cache-2-cache.
We at Red Hat have been running the development prototype of c2c on lots of big Linux applications and it’s uncovered many hot false sharing cachelines.

I’ve been playing with this tool quite a bit. It is pretty cool. Let me share a little about what it is and how to use it.

At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
* The pid, tid, instruction addr, function name, binary object name for those readers and writers.
* The source file and line number for each reader and writer.
* The average load latency for the loads to those cachelines.
* Which numa nodes the samples a cacheline came from and which cpus were involved.

Using perf c2c is similar to using the Linux perf tool today.
First collect data with “perf c2c record ” Then generate a report output with “perf c2c report ”

Before covering the output data, here is a “how to” for the flags to use when calling “perf c2c”:
c2c usage flags

Then here’s an output file from a recent “perf c2c” run I did:
c2c output file

And, if you want to play with it yourself, here’s a simple source file to generate lots of false sharing.
False sharing .c src file

First I’ll go over the output file to highlight the interesting fields.

This first table in the output file gives a high level summary of all the load and store samples collected. It is interesting to see where your program’s load instructions got their data.
Notice the term “HITM”, which stands for a load that hit in a modified cacheline. That’s the key that false sharing has occured. Remote HITMs, meaning across numa nodes, are the most expensive - especially when there are lots of readers and writers.

 1  =================================================
 2              Trace Event Information
 3  =================================================
 4    Total records                     :     329219  << Total loads and stores sampled.
 5    Locked Load/Store Operations      :      14654
 6    Load Operations                   :      69679  << Total loads
 7    Loads - uncacheable               :          0
 8    Loads - IO                        :          0
 9    Loads - Miss                      :       3972
10    Loads - no mapping                :          0
11    Load Fill Buffer Hit              :      11958
12    Load L1D hit                      :      17235  << loads that hit in the L1 cache.
13    Load L2D hit                      :         21
14    Load LLC hit                      :      14219  << loads that hit in the last level cache (LLC).
15    Load Local HITM                   :       3402  << loads that hit in a modified cache on the same numa node (local HITM).
16    Load Remote HITM                  :      12757  << loads that hit in a modified cache on a remote numa node (remote HITM).
17    Load Remote HIT                   :       5295
18    Load Local DRAM                   :        976  << loads that hit in the local node's main memory.
19    Load Remote DRAM                  :       3246  << loads that hit in a remote node's main memory.
20    Load MESI State Exclusive         :       4222 
21    Load MESI State Shared            :          0
22    Load LLC Misses                   :      22274  << loads not found in any local node caches.
23    LLC Misses to Local DRAM          :        4.4% << % hitting in local node's main memory.
24    LLC Misses to Remote DRAM         :       14.6% << % hitting in a remote node's main memory.
25    LLC Misses to Remote cache (HIT)  :       23.8% << % hitting in a clean cache in a remote node.
26    LLC Misses to Remote cache (HITM) :       57.3% << % hitting in remote modified cache. (most expensive - false sharing)
27    Store Operations                  :     259539  << store instruction sample count
28    Store - uncacheable               :          0
29    Store - no mapping                :         11
30    Store L1D Hit                     :     256696  << stores that got L1 cache when requested.
31    Store L1D Miss                    :       2832  << stores that couldn't get the L1 cache when requested (L1 miss).
32    No Page Map Rejects               :       2376
33    Unable to parse data source       :          1

The second table, (below), in the output file gives a brief one-line summary of the hottest cachelines where false sharing was detected. It’s sorted by which line had the most remote HITMs (or local HITMs if you select that sort option). It gives a nice high level sense for the load and store activity for each cacheline.
I look to see if a cacheline has a high number of “Rmt LLC Load Hitm’s”. If so, it’s time to dig further.

54  =================================================
55             Shared Data Cache Line Table          
56  =================================================
57  #
58  #                              Total      Rmt  ----- LLC Load Hitm -----  ---- Store Reference ----  --- Load Dram ----      LLC    Total  ----- Core Load Hit -----  -- LLC Load Hit --
59  # Index           Cacheline  records     Hitm    Total      Lcl      Rmt    Total    L1Hit   L1Miss       Lcl       Rmt  Ld Miss    Loads       FB       L1       L2       Llc       Rmt
60  # .....  ..................  .......  .......  .......  .......  .......  .......  .......  .......  ........  ........  .......  .......  .......  .......  .......  ........  ........
61  #
62        0            0x602180   149904   77.09%    12103     2269     9834   109504   109036      468       727      2657    13747    40400     5355    16154        0      2875       529
63        1            0x602100    12128   22.20%     3951     1119     2832        0        0        0        65       200     3749    12128     5096      108        0      2056       652
64        2  0xffff883ffb6a7e80      260    0.09%       15        3       12      161      161        0         1         1       15       99       25       50        0         6         1
65        3  0xffffffff81aec000      157    0.07%        9        0        9        1        0        1         0         7       20      156       50       59        0        27         4
66        4  0xffffffff81e3f540      179    0.06%        9        1        8      117       97       20         0        10       25       62       11        1        0        24         7

Next is the Pareto table, which shows lots of valuable information about each contended cacheline. This is the most important table in the output. I only show three cachelines here to keep this blog simple. Here’s what’s in it.

    * Lines 71 and 72 are the column headers for what’s happening in each cacheline.
    * Line 76 shows the HITM and store activity for each cacheline - first with counts for load
       and store activity, followed by the cacheline virtual data address.
    * Then there’s the data address column. Line 76 shows the virtual address of the cacheline.
       Each row underneath is represents the offset into the cachline where those accesses occured.
    * The next column shows the pid, and/or the thread id (tid) if you selected that for the output.
    * Following is the instruction pointer code address.
    * Next are three columns showing the average load latencies. I always look here for long
       latency averages, which is a sign for how painful the contention was to that cacheline.
   * The “cpu cnt” column shows how many different cpus samples came from.
   * Then there’s the function name, binary object name, source file and line number.
   * The last column shows for each node, the specific cpus that samples came from.

67  =================================================
68        Shared Cache Line Distribution Pareto      
69  =================================================
70  #
71  #        ----- HITM -----  -- Store Refs --        Data address                               ---------- cycles ----------       cpu                                     Shared                                   
72  #   Num      Rmt      Lcl   L1 Hit  L1 Miss              Offset      Pid        Code address  rmt hitm  lcl hitm      load       cnt               Symbol                Object                  Source:Line  Node{cpu list}
73  # .....  .......  .......  .......  .......  ..................  .......  ..................  ........  ........  ........  ........  ...................  ....................  ...........................  ....
74  #
75    -------------------------------------------------------------
76        0     9834     2269   109036      468            0x602180
77    -------------------------------------------------------------
78            65.51%   55.88%   75.20%    0.00%                 0x0    14604            0x400b4f     27161     26039     26017         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:144   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
79             0.41%    0.35%    0.00%    0.00%                 0x0    14604            0x400b56     18088     12601     26671         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:145   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
80             0.00%    0.00%   24.80%  100.00%                 0x0    14604            0x400b61         0         0         0         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:145   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
81             7.50%    9.92%    0.00%    0.00%                0x20    14604            0x400ba7      2470      1729      1897         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:154   1{122}  2{144}
82            17.61%   20.89%    0.00%    0.00%                0x28    14604            0x400bc1      2294      1575      1649         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:158   2{53}  3{170}
83             8.97%   12.96%    0.00%    0.00%                0x30    14604            0x400bdb      2325      1897      1828         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:162   0{96}  3{171}

84    -------------------------------------------------------------
85        1     2832     1119        0        0            0x602100
86    -------------------------------------------------------------
87            29.13%   36.19%    0.00%    0.00%                0x20    14604            0x400bb3      1964      1230      1788         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:155   1{122}  2{144}
88            43.68%   34.41%    0.00%    0.00%                0x28    14604            0x400bcd      2274      1566      1793         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:159   2{53}  3{170}
89            27.19%   29.40%    0.00%    0.00%                0x30    14604            0x400be7      2045      1247      2011         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:163   0{96}  3{171}

90    -------------------------------------------------------------
91        2       12        3      161        0  0xffff883ffb6a7e80
92    -------------------------------------------------------------
93            58.33%  100.00%    0.00%    0.00%                 0x0    14604  0xffffffff810cf16d      1380       941      1229         9  [k] task_tick_fair              [kernel.kallsyms]  atomic64_64.h:21   0{0,4,96}  1{25,120,122}  2{53}  3{170-171}
94            16.67%    0.00%   98.76%    0.00%                 0x0    14604  0xffffffff810c9379      1794         0       625        13  [k] update_cfs_rq_blocked_load  [kernel.kallsyms]  atomic64_64.h:45   0{1,4,96}  1{25,120,122}  2{48,53-54,144}  3{169-171}
95            16.67%    0.00%    0.00%    0.00%                 0x0    14604  0xffffffff810ce098      1382         0       867        12  [k] update_cfs_shares           [kernel.kallsyms]  atomic64_64.h:21   0{1,4,96}  1{25,120,122}  2{53-54,144}  3{169-171}
96             8.33%    0.00%    0.00%    0.00%                 0x8    14604  0xffffffff810cf18c      2560         0       679         8  [k] task_tick_fair              [kernel.kallsyms]  atomic.h:26        0{4,96}  1{24-25,120,122}  2{54}  3{170}
97             0.00%    0.00%    1.24%    0.00%                 0x8    14604  0xffffffff810cf14f         0         0         0         2  [k] task_tick_fair              [kernel.kallsyms]  atomic.h:50        2{48,53}

How I often use “perf c2c”

Here are the flags I most commonly use.

   perf c2c record -F 60000 -a --all-user sleep 5
   perf c2c record -F 60000 -a --all-user sleep 3     // or to sample for a shorter time.
   perf c2c record -F 60000 -a --all-kernel sleep 3   // or to only gather kernel samples.
   perf c2c record -F 60000 -a -u --ldlat 50 sleep 3  // or to collect only loads >= 50 cycles of load latency (30 is the ldlat default).

To generate report files, you can use the graphical tui report or send the output to stdout:

 perf report -NN -c pid,iaddr                 // to use the tui interactive report
 perf report -NN -c pid,iaddr --stdio         // or to send the output to stdout
 perf report -NN -d lcl -c pid,iaddr --stdio  // or to sort on local hitms

By default, symbol names are truncated to a fixed width - for readability.
You can use the “–full-symbols” flag to get full symbol names in the output.
For example:

 perf c2c report -NN -c pid,iaddr --full-symbols --stdio 

Finding the callers to these cachelines:

Sometimes it’s valuable to know who the callers are. Here is how to get call graph information.
I never generate call graph info initially because it emits so much data, it makes it very difficult to see if and where a false sharing problem exists. I find the problem first without call graphs, then if needed I’ll rerun with call graphs.

perf c2c record --call-graph dwarf,8192 -F 60000 -a --all-user sleep 5
perf c2c report -NN -g --call-graph -c pid,iaddr --stdio 

Does bumping perf’s sample rate help?

I’ll sometimes bump the perf sample rate with “-F 60000” or “-F 80000”.
There’s no requirement to do so, but it is a good way to get a richer sample collection in a shorter period of time. If you do, it’s helpful to bump the kernel’s perf sample rate up with the following two echo commands. (see dmesg for “perf interrupt took too long …” sample lowering entries).

 echo    500 > /proc/sys/kernel/perf_cpu_time_max_percent
 echo 100000 > /proc/sys/kernel/perf_event_max_sample_rate
 <then do your "perf c2c record" here>
 echo     50 > /proc/sys/kernel/perf_cpu_time_max_percent

What to do when perf drowns in excessive samples:

When running on larger systems (e.g. 4, 8 or 16 socket systems), there can be so many samples that the perf tool can consume lots of cpu time and the perf.data file size grows significantly.
Some tips to help that include:
- Bump the ldlat from the default of 30 to 50. This free’s perf to skip the faster non-interesting loads.
- Lower the sample rate.
- Shorten the sleep time during the “perf record” window. For ex, from “sleep 5” to “sleep 3”.

What I’ve learned by using C2C on numerous applications:

It’s common to look at any performance tool output and ask ‘what does all this data mean?’.
Here are some things I’ve learned. Hopefully they’re of help.

    * I tend to run “perf c2c” for 3, 5, or 10 seconds. Running it any longer may take you
       from seeing concurrent false sharing to seeing cacheline accesses which are
       disjoint in time.
    * If you’re not interested in kernel samples, you’ll get better samples in your program by
       specifying –all—user.   Conversely, specifying –all-kernel is useful when focusing on the
       kernel.
    * On busy systems with high cpu counts , like >148 cpus, setting –ldlat to a higher value
       (like 50 or even 70) may enable perf to generate richer C2C samples.
    * Look at the Trace Event table at the top, specifically the “LLC Misses to Remote cache HITM”
       number. If it’s not close to zero, then there’s likely worthwhile false sharing to pursue resolving.
    * Most of the time the top one, two, or three cachelines in the Shared Cache Line Distribution
       Pareto table are the ones to focus on.
    * However, sometimes you’ll see the same code from multiple threads causing “less hot”
       contention, but you will see it on multiple cachelines for different data addresses.
       Even though any one of those lines are less hot individually, fixing them is often a
       win because the benefit is spread across many cachelines. This can also happen with
       different processes executing the same code accessing shared memory.
    * In the Shared Cache Line Distribution Pareto table, if you see long load average load latencies,
       it’s often a giveaway that false sharing contention is heavy and is hurting performance.
    * Then looking to see what nodes and cpus the samples for those accesses are coming from
       can often be a valuable guide to numa-pinning your processes or memory.
   
For processes using shared memory, it is possible for them to use different virtual addresses,
all pointing to (and contending with) the same shared memory location. They will show
up in the Pareto table as different cachelines, but in fact they are the same cacheline.
These can be tricky to spot. I usually uncover these by first looking to see that shared memory is being used, and then looking for similar patterns in the information provided
for each cacheline.

Last, the Shared Cache Line Distribution Pareto table can also provide great insight into any
ill-aligned hot data.
For example:
    * It’s easy to spot heavily modified variables that need to be placed into their own cachelines.
       This will enable them to be less contended (and run faster), and it will help accesses to
       the other variables that shared their cacheline to not be slowed down.
    * It’s easy to spot hot locks or mutexes that are unaligned and spill into multiple cachelines.
    * It’s easy to spot “read mostly” variables which can be grouped together into their own
       cachelines.

The raw samples can be helpful.

I’ve often found it valuable to take a peek at the raw instruction samples contained in the perf.data file (the one generated by the “perf c2c record”). You can get those raw samples using “perf script”. See man perf-script. The output may be cryptic, but you can sort on the load weight (5th column) to see which loads suffered the most from false sharing contention and took the longest to execute.

The c2c functionality is available in the upstream perf as of the Linux 4.2 kernel./h4>

Lastly, this was a collective effort.

Although Don Zickus, Dick Fowles and Joe Mario worked together to get this implemented, we got lots of early help from Arnaldo Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
Additionally Jiri has been heavily involved recently integrating the c2c functionality into perf.
A big thanks to all of you for helping to pull this together!

与[转帖]C2C - False Sharing Detection in Linux Perf相似的内容:

[转帖]C2C - False Sharing Detection in Linux Perf

https://joemario.github.io/blog/2016/09/01/c2c-blog/ Do you run your application in a NUMA environment? Is it multi-threaded? Is it multi-process with

[转帖]全球十大模拟IC厂商排名出炉,TI霸主,微芯上榜

https://www.sohu.com/a/232168751_99984320?qq-pf-to=pcqq.c2c 美高森美和安森美 不是一家.. 前天,IC Insights公布了2017年全球10大模拟芯片厂商营收及排名。据IC Insights统计,2017年全球模拟芯片的市场规模为545

[转帖]线上问题零发生,闲鱼稳定性问题治理与监控优化

http://blog.itpub.net/28285180/viewspace-2940749/ 一、引言 闲鱼作为C2C电商交易平台,消息系统是导购链路上关键的一环。用户依赖聊天建立买家与卖家的信任,进一步获取商品信息。闲鱼消息的稳定性直接影响到闲鱼用户体验,成交效率。为强化闲鱼消息系统的稳定性

[转帖]

Linux ubuntu20.04 网络配置(图文教程) 因为我是刚装好的最小系统,所以很多东西都没有,在开始配置之前需要做下准备 环境准备 系统:ubuntu20.04网卡:双网卡 网卡一:供连接互联网使用网卡二:供连接内网使用(看情况,如果一张网卡足够,没必要做第二张网卡) 工具: net-to

[转帖]

https://cloud.tencent.com/developer/article/2168105?areaSource=104001.13&traceId=zcVNsKTUApF9rNJSkcCbB 前言 Redis作为高性能的内存数据库,在大数据量的情况下也会遇到性能瓶颈,日常开发中只有时刻

[转帖]ISV 、OSV、 SIG 概念

ISV 、OSV、 SIG 概念 2022-10-14 12:29530原创大杂烩 本文链接:https://www.cndba.cn/dave/article/108699 1. ISV: Independent Software Vendors “独立软件开发商”,特指专门从事软件的开发、生产、

[转帖]Redis 7 参数 修改 说明

2022-06-16 14:491800原创Redis 本文链接:https://www.cndba.cn/dave/article/108066 在之前的博客我们介绍了Redis 7 的安装和配置,如下: Linux 7.8 平台 Redis 7 安装并配置开机自启动 操作手册https://ww

[转帖]HTTPS中间人攻击原理

https://www.zhihu.com/people/bei-ji-85/posts 背景 前一段时间,公司北京地区上线了一个HTTPS防火墙,用来监听HTTPS流量。防火墙上线之前,邮件通知给管理层,我从我老大那里听说这个事情的时候,说这个有风险,然后意外地发现,很多人原来都不知道HTTPS防

[转帖]关于字节序(大小端)的一点想法

https://www.zhihu.com/people/bei-ji-85/posts 今天在一个技术群里有人问起来了,当时有一些讨论(不完全都是我个人的观点),整理一下: 为什么网络字节序(多数情况下)是大端? 早年设备的缓存很小,先接收高字节能快速的判断报文信息:包长度(需要准备多大缓存)、地

[转帖]awk提取某一行某一列的数据

https://www.jianshu.com/p/dbcb7fe2da56 1、提取文件中第1列数据 awk '{print $1}' filename > out.txt 2、提取前2列的文件 awk `{print $1,$2}' filename > out.txt 3、打印完第一列,然后打