Do you run your application in a NUMA environment? Is it multi-threaded? Is it multi-process with shared memory? If so, is your performance impacted by false sharing?
Now there’s a way to easily find out. We’re posting patches for a new feature to the Linux perf tool, called “c2c” for cache-2-cache.
We at Red Hat have been running the development prototype of c2c on lots of big Linux applications and it’s uncovered many hot false sharing cachelines.
I’ve been playing with this tool quite a bit. It is pretty cool. Let me share a little about what it is and how to use it.
At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
* The pid, tid, instruction addr, function name, binary object name for those readers and writers.
* The source file and line number for each reader and writer.
* The average load latency for the loads to those cachelines.
* Which numa nodes the samples a cacheline came from and which cpus were involved.
Using perf c2c is similar to using the Linux perf tool today.
First collect data with “perf c2c record ” Then generate a report output with “perf c2c report ”
Before covering the output data, here is a “how to” for the flags to use when calling “perf c2c”:
c2c usage flags
Then here’s an output file from a recent “perf c2c” run I did:
c2c output file
And, if you want to play with it yourself, here’s a simple source file to generate lots of false sharing.
False sharing .c src file
First I’ll go over the output file to highlight the interesting fields.
This first table in the output file gives a high level summary of all the load and store samples collected. It is interesting to see where your program’s load instructions got their data.
Notice the term “HITM”, which stands for a load that hit in a modified cacheline. That’s the key that false sharing has occured. Remote HITMs, meaning across numa nodes, are the most expensive - especially when there are lots of readers and writers.
1 =================================================
2 Trace Event Information
3 =================================================
4 Total records : 329219 << Total loads and stores sampled.
5 Locked Load/Store Operations : 14654
6 Load Operations : 69679 << Total loads
7 Loads - uncacheable : 0
8 Loads - IO : 0
9 Loads - Miss : 3972
10 Loads - no mapping : 0
11 Load Fill Buffer Hit : 11958
12 Load L1D hit : 17235 << loads that hit in the L1 cache.
13 Load L2D hit : 21
14 Load LLC hit : 14219 << loads that hit in the last level cache (LLC).
15 Load Local HITM : 3402 << loads that hit in a modified cache on the same numa node (local HITM).
16 Load Remote HITM : 12757 << loads that hit in a modified cache on a remote numa node (remote HITM).
17 Load Remote HIT : 5295
18 Load Local DRAM : 976 << loads that hit in the local node's main memory.
19 Load Remote DRAM : 3246 << loads that hit in a remote node's main memory.
20 Load MESI State Exclusive : 4222
21 Load MESI State Shared : 0
22 Load LLC Misses : 22274 << loads not found in any local node caches.
23 LLC Misses to Local DRAM : 4.4% << % hitting in local node's main memory.
24 LLC Misses to Remote DRAM : 14.6% << % hitting in a remote node's main memory.
25 LLC Misses to Remote cache (HIT) : 23.8% << % hitting in a clean cache in a remote node.
26 LLC Misses to Remote cache (HITM) : 57.3% << % hitting in remote modified cache. (most expensive - false sharing)
27 Store Operations : 259539 << store instruction sample count
28 Store - uncacheable : 0
29 Store - no mapping : 11
30 Store L1D Hit : 256696 << stores that got L1 cache when requested.
31 Store L1D Miss : 2832 << stores that couldn't get the L1 cache when requested (L1 miss).
32 No Page Map Rejects : 2376
33 Unable to parse data source : 1
The second table, (below), in the output file gives a brief one-line summary of the hottest cachelines where false sharing was detected. It’s sorted by which line had the most remote HITMs (or local HITMs if you select that sort option). It gives a nice high level sense for the load and store activity for each cacheline.
I look to see if a cacheline has a high number of “Rmt LLC Load Hitm’s”. If so, it’s time to dig further.
54 =================================================
55 Shared Data Cache Line Table
56 =================================================
57 #
58 # Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
59 # Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
60 # ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ....... ........ ........
61 #
62 0 0x602180 149904 77.09% 12103 2269 9834 109504 109036 468 727 2657 13747 40400 5355 16154 0 2875 529
63 1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65 200 3749 12128 5096 108 0 2056 652
64 2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1 1 15 99 25 50 0 6 1
65 3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7 20 156 50 59 0 27 4
66 4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0 10 25 62 11 1 0 24 7
Next is the Pareto table, which shows lots of valuable information about each contended cacheline. This is the most important table in the output. I only show three cachelines here to keep this blog simple. Here’s what’s in it.
* Lines 71 and 72 are the column headers for what’s happening in each cacheline.
* Line 76 shows the HITM and store activity for each cacheline - first with counts for load
and store activity, followed by the cacheline virtual data address.
* Then there’s the data address column. Line 76 shows the virtual address of the cacheline.
Each row underneath is represents the offset into the cachline where those accesses occured.
* The next column shows the pid, and/or the thread id (tid) if you selected that for the output.
* Following is the instruction pointer code address.
* Next are three columns showing the average load latencies. I always look here for long
latency averages, which is a sign for how painful the contention was to that cacheline.
* The “cpu cnt” column shows how many different cpus samples came from.
* Then there’s the function name, binary object name, source file and line number.
* The last column shows for each node, the specific cpus that samples came from.
67 =================================================
68 Shared Cache Line Distribution Pareto
69 =================================================
70 #
71 # ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- cpu Shared
72 # Num Rmt Lcl L1 Hit L1 Miss Offset Pid Code address rmt hitm lcl hitm load cnt Symbol Object Source:Line Node{cpu list}
73 # ..... ....... ....... ....... ....... .................. ....... .................. ........ ........ ........ ........ ................... .................... ........................... ....
74 #
75 -------------------------------------------------------------
76 0 9834 2269 109036 468 0x602180
77 -------------------------------------------------------------
78 65.51% 55.88% 75.20% 0.00% 0x0 14604 0x400b4f 27161 26039 26017 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:144 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
79 0.41% 0.35% 0.00% 0.00% 0x0 14604 0x400b56 18088 12601 26671 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
80 0.00% 0.00% 24.80% 100.00% 0x0 14604 0x400b61 0 0 0 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
81 7.50% 9.92% 0.00% 0.00% 0x20 14604 0x400ba7 2470 1729 1897 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:154 1{122} 2{144}
82 17.61% 20.89% 0.00% 0.00% 0x28 14604 0x400bc1 2294 1575 1649 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:158 2{53} 3{170}
83 8.97% 12.96% 0.00% 0.00% 0x30 14604 0x400bdb 2325 1897 1828 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:162 0{96} 3{171}
84 -------------------------------------------------------------
85 1 2832 1119 0 0 0x602100
86 -------------------------------------------------------------
87 29.13% 36.19% 0.00% 0.00% 0x20 14604 0x400bb3 1964 1230 1788 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:155 1{122} 2{144}
88 43.68% 34.41% 0.00% 0.00% 0x28 14604 0x400bcd 2274 1566 1793 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:159 2{53} 3{170}
89 27.19% 29.40% 0.00% 0.00% 0x30 14604 0x400be7 2045 1247 2011 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:163 0{96} 3{171}
90 -------------------------------------------------------------
91 2 12 3 161 0 0xffff883ffb6a7e80
92 -------------------------------------------------------------
93 58.33% 100.00% 0.00% 0.00% 0x0 14604 0xffffffff810cf16d 1380 941 1229 9 [k] task_tick_fair [kernel.kallsyms] atomic64_64.h:21 0{0,4,96} 1{25,120,122} 2{53} 3{170-171}
94 16.67% 0.00% 98.76% 0.00% 0x0 14604 0xffffffff810c9379 1794 0 625 13 [k] update_cfs_rq_blocked_load [kernel.kallsyms] atomic64_64.h:45 0{1,4,96} 1{25,120,122} 2{48,53-54,144} 3{169-171}
95 16.67% 0.00% 0.00% 0.00% 0x0 14604 0xffffffff810ce098 1382 0 867 12 [k] update_cfs_shares [kernel.kallsyms] atomic64_64.h:21 0{1,4,96} 1{25,120,122} 2{53-54,144} 3{169-171}
96 8.33% 0.00% 0.00% 0.00% 0x8 14604 0xffffffff810cf18c 2560 0 679 8 [k] task_tick_fair [kernel.kallsyms] atomic.h:26 0{4,96} 1{24-25,120,122} 2{54} 3{170}
97 0.00% 0.00% 1.24% 0.00% 0x8 14604 0xffffffff810cf14f 0 0 0 2 [k] task_tick_fair [kernel.kallsyms] atomic.h:50 2{48,53}
How I often use “perf c2c”
Here are the flags I most commonly use.
perf c2c record -F 60000 -a --all-user sleep 5
perf c2c record -F 60000 -a --all-user sleep 3 // or to sample for a shorter time.
perf c2c record -F 60000 -a --all-kernel sleep 3 // or to only gather kernel samples.
perf c2c record -F 60000 -a -u --ldlat 50 sleep 3 // or to collect only loads >= 50 cycles of load latency (30 is the ldlat default).
To generate report files, you can use the graphical tui report or send the output to stdout:
perf report -NN -c pid,iaddr // to use the tui interactive report
perf report -NN -c pid,iaddr --stdio // or to send the output to stdout
perf report -NN -d lcl -c pid,iaddr --stdio // or to sort on local hitms
By default, symbol names are truncated to a fixed width - for readability.
You can use the “–full-symbols” flag to get full symbol names in the output.
For example:
perf c2c report -NN -c pid,iaddr --full-symbols --stdio
Finding the callers to these cachelines:
Sometimes it’s valuable to know who the callers are. Here is how to get call graph information.
I never generate call graph info initially because it emits so much data, it makes it very difficult to see if and where a false sharing problem exists. I find the problem first without call graphs, then if needed I’ll rerun with call graphs.
perf c2c record --call-graph dwarf,8192 -F 60000 -a --all-user sleep 5
perf c2c report -NN -g --call-graph -c pid,iaddr --stdio
Does bumping perf’s sample rate help?
I’ll sometimes bump the perf sample rate with “-F 60000” or “-F 80000”.
There’s no requirement to do so, but it is a good way to get a richer sample collection in a shorter period of time. If you do, it’s helpful to bump the kernel’s perf sample rate up with the following two echo commands. (see dmesg for “perf interrupt took too long …” sample lowering entries).
echo 500 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 100000 > /proc/sys/kernel/perf_event_max_sample_rate
<then do your "perf c2c record" here>
echo 50 > /proc/sys/kernel/perf_cpu_time_max_percent
What to do when perf drowns in excessive samples:
When running on larger systems (e.g. 4, 8 or 16 socket systems), there can be so many samples that the perf tool can consume lots of cpu time and the perf.data file size grows significantly.
Some tips to help that include:
- Bump the ldlat from the default of 30 to 50. This free’s perf to skip the faster non-interesting loads.
- Lower the sample rate.
- Shorten the sleep time during the “perf record” window. For ex, from “sleep 5” to “sleep 3”.
What I’ve learned by using C2C on numerous applications:
It’s common to look at any performance tool output and ask ‘what does all this data mean?’.
Here are some things I’ve learned. Hopefully they’re of help.
* I tend to run “perf c2c” for 3, 5, or 10 seconds. Running it any longer may take you
from seeing concurrent false sharing to seeing cacheline accesses which are
disjoint in time.
* If you’re not interested in kernel samples, you’ll get better samples in your program by
specifying –all—user. Conversely, specifying –all-kernel is useful when focusing on the
kernel.
* On busy systems with high cpu counts , like >148 cpus, setting –ldlat to a higher value
(like 50 or even 70) may enable perf to generate richer C2C samples.
* Look at the Trace Event table at the top, specifically the “LLC Misses to Remote cache HITM”
number. If it’s not close to zero, then there’s likely worthwhile false sharing to pursue resolving.
* Most of the time the top one, two, or three cachelines in the Shared Cache Line Distribution
Pareto table are the ones to focus on.
* However, sometimes you’ll see the same code from multiple threads causing “less hot”
contention, but you will see it on multiple cachelines for different data addresses.
Even though any one of those lines are less hot individually, fixing them is often a
win because the benefit is spread across many cachelines. This can also happen with
different processes executing the same code accessing shared memory.
* In the Shared Cache Line Distribution Pareto table, if you see long load average load latencies,
it’s often a giveaway that false sharing contention is heavy and is hurting performance.
* Then looking to see what nodes and cpus the samples for those accesses are coming from
can often be a valuable guide to numa-pinning your processes or memory.
For processes using shared memory, it is possible for them to use different virtual addresses,
all pointing to (and contending with) the same shared memory location. They will show
up in the Pareto table as different cachelines, but in fact they are the same cacheline.
These can be tricky to spot. I usually uncover these by first looking to see that shared memory is being used, and then looking for similar patterns in the information provided
for each cacheline.
Last, the Shared Cache Line Distribution Pareto table can also provide great insight into any
ill-aligned hot data.
For example:
* It’s easy to spot heavily modified variables that need to be placed into their own cachelines.
This will enable them to be less contended (and run faster), and it will help accesses to
the other variables that shared their cacheline to not be slowed down.
* It’s easy to spot hot locks or mutexes that are unaligned and spill into multiple cachelines.
* It’s easy to spot “read mostly” variables which can be grouped together into their own
cachelines.
The raw samples can be helpful.
I’ve often found it valuable to take a peek at the raw instruction samples contained in the perf.data file (the one generated by the “perf c2c record”). You can get those raw samples using “perf script”. See man perf-script. The output may be cryptic, but you can sort on the load weight (5th column) to see which loads suffered the most from false sharing contention and took the longest to execute.
The c2c functionality is available in the upstream perf as of the Linux 4.2 kernel./h4>
Lastly, this was a collective effort.
Although Don Zickus, Dick Fowles and Joe Mario worked together to get this implemented, we got lots of early help from Arnaldo Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
Additionally Jiri has been heavily involved recently integrating the c2c functionality into perf.
A big thanks to all of you for helping to pull this together!