[转帖]Sar的一次使用案例

sar,一次,使用,案例 · 浏览次数 : 0

小编点评

## Problem Analysis The provided text describes a machine experiencing stability issues due to high CPU usage. Here's a breakdown of the key points: **Symptoms:** * OS does not respond for some time. * Logs show various `perf: interrupt took too long` messages. * There are also multiple occurrences of `promiscuous mode` entries. **Possible Causes:** * The high CPU usage could be causing the OS to become unresponsive. * The `perf: interrupt took too long` messages suggest that perf is struggling to collect performance data. * The `promiscuous mode` entries could indicate that the machine is experiencing performance issues. **Further Investigation:** * Investigate the cause of high CPU usage. This could involve analyzing resource usage, running stress tests, and checking logs for any relevant errors. * Review the perf logs to see if there are any specific instances of high CPU usage. * Investigate the `promiscuous mode` entries to see if there are any related issues. * Check the system logs for any other error messages or warnings related to the machine. **Additional Information:** * The provided images are relevant but not directly related to the main text. **Recommendations:** * Monitor the machine's CPU usage and perf logs closely. * Investigate the root cause of high CPU usage. * Take appropriate actions to address the underlying issue. * If necessary, consider restarting the machine to resolve the stability problem.

正文

https://www.jianshu.com/p/b93342d43e13
复制

问题现象

有一台机器，在某个时间点OS类似无响应，造成使用者感觉在该时间点机器应该发生重启，就此问题进行分析。

日志查看

--BMC日志确认机器在该时间点没有发生重启
--OS日志在该时间点也没有记录到重启，但日志记录了一些异常，如下所示

    Oct 23 12:46:59 localhost kernel: device ens3f1 left promiscuous mode
    Oct 23 13:11:32 localhost rsyslogd: imjournal: journal reloaded... [v8.24.0 try http://www.rsyslog.com/e/0 ]
    Oct 23 13:11:32 localhost rsyslogd: imjournal: journal reloaded... [v8.24.0 try http://www.rsyslog.com/e/0 ]
    Oct 23 13:46:25 localhost kernel: device ens6f1 entered promiscuous mode
    Oct 23 13:46:28 localhost kernel: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 907875.014 msecs
    Oct 23 13:46:28 localhost kernel: perf: interrupt took too long (120883 > 7575), lowering kernel.perf_event_max_sample_rate to 1000
    Oct 23 13:46:31 localhost kernel: INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 982076.016 msecs
    Oct 23 13:46:31 localhost kernel: INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 982837.016 msecs
    Oct 23 13:46:37 localhost kernel: device ens6f1 left promiscuous mode
    Oct 23 13:46:38 localhost kernel: device ens3f1 entered promiscuous mode
    Oct 23 13:46:59 localhost kernel: device ens3f1 left promiscuous mode
    Oct 23 13:50:19 localhost kernel: perf: interrupt took too long (151272 > 151103), lowering kernel.perf_event_max_sample_rate to 1000
    Oct 23 13:52:39 localhost start_filebeat.sh: /opt/pamon/filebeat-redis/start_filebeat.sh: line 14: 72199 Killed                  $filebeat_exe -c $filebeat_config
    Oct 23 13:55:26 localhost kernel: perf: interrupt took too long (189323 > 189090), lowering kernel.perf_event_max_sample_rate to 1000
    Oct 23 14:17:02 localhost auditd[1623]: Audit daemon rotating log files
    Oct 23 14:30:17 localhost kernel: perf: interrupt took too long (237320 > 236653), lowering kernel.perf_event_max_sample_rate to 1000
    Oct 23 14:37:39 localhost su: (to redis) padba on pts/0
    Oct 23 14:39:34 localhost su: (to redis) padba on pts/1
    Oct 23 14:51:17 localhost kernel: device ens6f1 entered promiscuous mode
    Oct 23 14:51:37 localhost kernel: device ens6f1 left promiscuous mode
    Oct 23 14:51:42 localhost kernel: device ens3f1 entered promiscuous mode
    Oct 23 14:51:59 localhost kernel: device ens3f1 left promiscuous mode
    Oct 23 15:09:42 localhost kernel: hrtimer: interrupt took 542805 ns
    Oct 23 15:53:37 localhost kernel: device ens6f1 entered promiscuous mode
    Oct 23 15:54:07 localhost kernel: device ens6f1 left promiscuous mode
复制

--显然在该时间点OS没有发生重启，但发生了一些异常的log打印，如perf: interrupt took too long 、NMI handler 、promiscuous mode，对此分析如下：
1、 promiscuous mode，通常在用户有监控软件抓取流量时会触发，同时从日志中也可以看到不断有此现象，不过此现象通常伴随在抓包的过程中，为正常日志情况。不过这个现象有些频繁，客户应该关注一点。
2、对于perf:interrupt took too long,通常这表示perf采集信息中断时间较长没有响应。
3、mi handler 类似。

所以需要进一步对perf的log打印查看，从OS日志中搜索perf的log，如下所示：

image.png

发现OS日志多次出现perf的异常log，进一步分析perf，perf是Linux下的一款性能分析工具，因此出现此问题初步判断可能跟机器的性能相关。

通过sar命令记录的性能数据进一步查看，查看上述异常时间点的性能数据，初步发现一个规律是上述异常时间点均伴随了较高的CPU使用率。

为验证上述猜测，进一步转化数据。

然后通过sos report中的sar文件来观察CPU使用率，方式是将sar07/sar23/sar13中的cpu使用率，导入至excel表中然后将cpu使用率画成折线图，最后如下所示：
Sar07如下所示，从折线图可以看出来，大概是在９点，接近十点的时候，ｃｐｕ的ｉｄｌｅ很低也就是ｃｐｕ使用率很高。

image.png

ｓａｒ２３如下所示，从折线图可以看出来，在１３点接近１４点的时候，ｃｐｕ使用率突然增高。

image.png

问题结论

综上所述，可以看出来，在发生问题的时间点均发生了CPU使用率过高的问题，因此可以判断是CPU使用率过高导致了问题。

sar工具查看网卡实时速率

使用iperf打流测试，然后通过sar来监控每个网卡的实时速率，如下所示：

image.png

使用的命令为：
[root@localhost ~]# sar -n DEV 1

[转帖]Sar的一次使用案例

小编点评

正文

问题现象

日志查看

所以需要进一步对perf的log打印查看，从OS日志中搜索perf的log，如下所示：

通过sar命令记录的性能数据进一步查看，查看上述异常时间点的性能数据，初步发现一个规律是上述异常时间点均伴随了较高的CPU使用率。

为验证上述猜测，进一步转化数据。

问题结论

sar工具查看网卡实时速率

与[转帖]Sar的一次使用案例相似的内容：

[转帖]Sar的一次使用案例

[转帖] Linux命令拾遗-使用blktrace分析io情况

[转帖] Linux命令拾遗-使用blktrace分析io情况

[转帖]使用ksar解析sar监控日志

[转帖]sar -Linux 上全面的系统性能分析工具（1）

[转帖] Linux命令拾遗-使用blktrace分析io情况

[转帖] Linux命令拾遗-使用blktrace分析io情况

[转帖]Linux查看硬件信息超强命令sar，以及可视化工具ksar

[转帖]ubuntu开启sar记录日志

[转帖]ubuntu开启sar记录日志

# 热门排行