[转帖]Core dump实战分析之Java版

core,dump,实战,分析,java · 浏览次数 : 0

小编点评

The context describes how to analyze a Java core dump using jstack and jmap tools, and how to interpret the obtained information. **Step 1: Using jstack to analyze the core dump** * Run the `jstack --helpjstack` command to get help on using jstack with a core dump. * Connect to a core file using `jstack /landon/lib/jdk/bin/java core.30298 >> core.30298.jstack.log`. * This command will display the thread state, including call stack information, for each thread in the core dump. **Step 2: Using jmap to analyze the core dump** * Run the `jmap --helpjmap` command to get help on using jmap with a core dump. * Connect to a core file using `jmap --helpjmap [option] <executable <core>`. * Use the following options with `jmap`: * `-a`: Specify the assembly file to analyze. * `-l`: Specify the number of threads to analyze. * This allows you to analyze specific threads within the core dump. **Step 3: Interpreting the generated information** * Look for lines in the output of `jstack` that indicate the size of the thread stack and other information. * Use `jmap` to analyze specific threads and look for relevant information such as call stack frames, debugging symbols, and thread state. **Additional Notes:** * The context mentions that some information may be unavailable due to stripping. * The `file` command can be used to check if a core file is stripped and, if so, how to access the debugging information. * The context also recommends using the `debuginfo-install` tool to install additional debugging information for the core.

正文

https://www.jianshu.com/p/2cdf71f99209
复制

Core dump实战分析（Java版）

背景

项目中的battleserver进程在某一段时间总是crash，无法找到具体Crash原因
- Java通过JNI调用Luajit
那么进程Crash如何找到JNI的堆栈（C层），进而确认底层问题呢？

Crash分析之hs_error

Java进程Crash后通常会生成一个hs_error%pid.log
- %p指进程的pid
- hs_error指HotSpot JVM error
该文件的位置可通过JVM参数'-XX:ErrorFile'指定如
- -XX:ErrorFile=/landon/business/battle/hs_error%p.log

一个典型的hs_error文件内容如下

第一部分文件头

Crash原因：SIGSEGV (0xb)
Problematic frame:C，即堆栈出现在了C层（native code）
Core dump已经写入，如果没有写入，会提示
- Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
- 如果要开启core dump，一定要打开ulimit -c参数

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f838c76f1ae, pid=8257, tid=0x00007f838b5b6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_221-b11) (build 1.8.0_221-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.221-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libluajit-5.1.so.2+0xc1ae]
#
# Core dump written. Default location: /landon/battle_server/online/core or core.8257
复制

第二部分是Crash的线程信息

这个看到出问题的线程是SOFA-SEV-BOLT-BIZ-12200-3-T11
_thread_in_native表示在执行native代码
中间一部分是寄存器上下文
- 从这里可以看到各寄存器的内存映射，从libjnlua5.1.so到ibjvm.so
- RSP堆栈指针指向了当前线程
最后一部分是堆栈信息
- 从Java frames到Native frames
- 从这里发现：无法看到Native frames

---------------  T H R E A D  ---------------

Current thread (0x00007f83a001e000):  JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000004

Registers:
Top of Stack: (sp=0x00007f838b5b4f70)
Instructions

Register to memory mapping:

RAX=0x0000000000000000 is an unknown value
RBX=0x0000000040ffb9c0 is an unknown value
RCX=0x0000000000000000 is an unknown value
RDX=0x00000000489f80d0 is an unknown value
RSP=0x00007f838b5b4f70 is pointing into the stack for thread: 0x00007f83a001e000
RBP=0x00000000407c23b8 is an unknown value
RSI=0x0000000041cb0e88 is an unknown value
RDI=0x00000000407c23b8 is an unknown value
R8 =0x00000007bfc84540 is an oop
com.naef.jnlua.LuaState 
 - klass: 'com/naef/jnlua/LuaState'
R9 =0x00000007bfc84540 is an oop
com.naef.jnlua.LuaState 
 - klass: 'com/naef/jnlua/LuaState'
R10=0x00000000000006ba is an unknown value
R11=0x00007f843da2cf3c: <offset 0xfccf3c> in /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so at 0x00007f843ca60000
R12=0x0000000000000000 is an unknown value
R13=0x00007f838b5b50d0 is pointing into the stack for thread: 0x00007f83a001e000
R14=0x00000000407c2fa8 is an unknown value
R15=0x00007f838c9e43e0: <offset 0xa3e0> in /landon/lib/libjnlua5.1.so at 0x00007f838c9da000


Stack: [0x00007f838b4b6000,0x00007f838b5b7000],  sp=0x00007f838b5b4f70,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libluajit-5.1.so.2+0xc1ae]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 4447  com.naef.jnlua.LuaState.lua_gc(II)I (0 bytes) @ 0x00007f84291dd278 [0x00007f84291dd240+0x38]
J 5983 C2 com.landon30.jlua.pool.JLua.call(Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (206 bytes) @ 0x00007f842946e01c [0x00007f842946dbe0+0x43c]
J 6040 C1 com.landon30.jlua.LuaCallManager.callLua(Ljava/util/Set;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (474 bytes) @ 0x00007f842965fd1c [0x00007f842965f4e0+0x83c]
复制

第三，四部分主要是Crash时进程和系统信息

包括所有的Java线程以及Current thread
还包括Java的堆信息等
最后是系统信息

---------------  P R O C E S S  ---------------

Java Threads: ( => current thread )
=>0x00007f83a001e000 JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)]

VM state:not at safepoint (normal execution)
VM Mutex/Monitor currently owned by a thread: None

Heap:
Card table byte_map:
Marking Bits:
Polling page:
CodeCache:

Compilation events:
GC Heap History:
Deoptimization events:
Classes redefined:
Internal exceptions:
Events:

Dynamic libraries:

VM Arguments:
Environment Variables:
Signal Handlers:

---------------  S Y S T E M  ---------------
复制

hs_error总结
- 通常在Java层面的crash，直接通过hs_error就可以直接定位，因为有堆栈信息
- 不过如果是C层面的Crash，则无堆栈
- 此时则需要分析Core dump了，主要前提是要打开ulimit -c

Crash分析之Core dump

通过上面的分析，打开ulimit -c后就会生成Core dump
- 6.9G Nov 2 21:58 core.26972
- 注：因Coredump文件较大，所以通常在进程Crash的写Coredump时cpu load和iowait都较高

分析Core dump的一个工具是gdb，所以必须安装gdb

$ gdb --version
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
复制

分析Java进程crash生成的coredump

用gdb打开

$ gdb /landon/lib/jdk/bin/java core.26972

输入bt或bt full命令查看backtrace

从这里就可以看到一些实际的C堆栈
如#15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71

(gdb) bt
#0  0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f2169c71cd5 in abort () at abort.c:92
#2  0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
#3  0x00007f2169725733 in VMError::report_and_die() ()
   from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
#4  0x00007f216956aa45 in JVM_handle_linux_signal ()
   from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
#5  0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) ()
   from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  0x00007f21380cbb8d in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x000000005d5fdfa8 in ?? ()
#10 0x000000005d696b20 in ?? ()
#11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2
#12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
#13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
#14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
#15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
#16 0x00007f2155595eaa in ?? ()
#17 0x0000000669220970 in ?? ()
#18 0x000000075f554730 in ?? ()
#19 0x0000000000000000 in ?? ()
复制

btfull命令可以看到详细的信息包括调用参数等

此时结合源代码如 lj_state.c:71 + 参数分析原因

(gdb) bt full
#0  0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = 0
        pid = <value optimized out>
        selftid = <value optimized out>
#1  0x00007f2169c71cd5 in abort () at abort.c:92
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7f20b70ede80, sa_sigaction = 0x7f20b70ede80}, sa_mask = {__val = {
              139779442242560, 0, 139781486046377, 0, 139781485559377, 139781480345909, 5872491825372126914, 1925, 335544324, 
              139781480145399, 10, 139781480027928, 55, 1, 0, 0}}, sa_flags = 0, sa_restorer = 0x260}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
No symbol table info available.
#3  0x00007f2169725733 in VMError::report_and_die() ()
   from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
No symbol table info available.
#4  0x00007f216956aa45 in JVM_handle_linux_signal ()
   from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
No symbol table info available.
#5  0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) ()
   from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
No symbol table info available.
#6  <signal handler called>
No symbol table info available.
#7  0x00007f21380cbb8d in ?? ()
No symbol table info available.
#8  0x0000000000000001 in ?? ()
No symbol table info available.
#9  0x000000005d5fdfa8 in ?? ()
No symbol table info available.
#10 0x000000005d696b20 in ?? ()
No symbol table info available.
#11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2
No symbol table info available.
#12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
No symbol table info available.
---Type <return> to continue, or q <return> to quit---
#13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
No symbol table info available.
#14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
No symbol table info available.
#15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
        st = 0x7f2114c63f66
        oldst = 0x7f20b70ef370
        delta = 0
        oldsize = 3071210352
        realsize = 32544
        up = 0x0
#16 0x00007f2155595eaa in ?? ()
No symbol table info available.
#17 0x0000000669220970 in ?? ()
No symbol table info available.
#18 0x000000075f554730 in ?? ()
No symbol table info available.
#19 0x0000000000000000 in ?? ()
No symbol table info available.
复制

按q退出
gdb还有许多其他命令，这里就不一一细说

使用jvm原生命令分析coredump

使用jstack分析

$ jstack /landon/lib/jdk/bin/java core.30298 >> core.30298.jstack.log

从这里可以看到所有的线程堆栈，而这个是hs_error没有的，hs_error只能看到当前线程堆栈和所有的线程
Deadlock Detection:
No deadlocks found.

Thread 30402: (state = IN_NATIVE)
 - com.naef.jnlua.LuaState.lua_pcall(int, int) @bci=0 (Compiled frame; information may be imprecise)
 - com.naef.jnlua.LuaState.call(int, int) @bci=7, line=708 (Compiled frame)
 - com.landon30.JLua.call(java.lang.String, java.lang.Object[]) @bci=53, line=46 (Compiled frame)
 - com.landon30.jlua.LuaCallManager.callLua(java.util.Set, java.lang.String, java.lang.Object[]) @bci=31, line=104 (Compiled frame)
 - com.landon30.jlua.LuaCallManager.callLua(java.lang.String, java.lang.String, java.lang.String, java.lang.Object[]) @bci=12, line=192 (Compiled frame)
 
.......
复制
使用jmap分析
$ jmap /landon/lib/jdk/bin/java core.30298
...
0x00007f274a5eb000      55K     /landon/lib/libjnlua5.1.so
0x00007f274a374000      474K    /lanodn/lib/libluajit-5.1.so.2
...
复制
$ jmap -dump:live,format=b,file=30298.hprof /landon/lib/jdk/bin/java core.30298

即将coredump文件转为hprof，这样就可以用如mat工具等分析内存相关了

此执行时间较长，线上慎用
注：其实jstack和jmap的help文档都有说明分析coredump
$ jstack --help
jstack [-m] [-l] <executable> <core>
        (to connect to a core file)
        
$ jmap --help
jmap [option] <executable <core>
        (to connect to a core file)
复制

关于调试信息

从gdb分析堆栈时，经常看到一些这样的信息 'No symbol table info available.'
- 没有可用符号表
- 其实简单来说就是调试信息，没有调试信息就无法看到详细的堆栈
  - 这里延伸到Java，javac编译的时候
```
-g:none                    Generate no debugging info
-g:{lines,vars,source}     Generate only some debugging info
复制
```

如何看一个so库是否有调试信息呢

使用nm命令：List symbols in [file(s)]

// 无符号表的一个so
$ nm -a libjnlua5.1.so_no_debug
nm: libjnlua5.1.so_no_debug: no symbols

$ nm -a libjnlua5.1.so
000000000020f818 b .bss
0000000000000000 n .comment
......

// 查看行号
$ nm -l libjnlua5.1.so | head
0000000000009a50 T JNI_OnLoad   /laodon/lib/JNLuaJIT-master/src/main/c/Linux/../jnlua.c:1911
复制

直接使用gdb命令即可

$ gdb libjnlua5.1.so

// 如果有符号表，会直接读取成功，否则
Reading symbols from /landon/lib/libjnlua5.1.so...done.

// 没有则直接提示 no debugging symbols found
Reading symbols from /landon/lib/libjnlua5.1.so_no_debug...(no debugging symbols found)...done.
复制

还有一个很重要的是要查看文件是否被stripped，如果被stripped，则一定没有调试信息

使用file命令查看即可

// 这个显示stripped，则一定没有调试信息
$ file libjnlua5.1.so_no_debug 
libjnlua5.1.so_no_debug: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped

// 显示not stripped
$ file libjnlua5.1.so
libjnlua5.1.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped
复制

如何生成调试信息：以三个例子说明
- LuaJIT-2.1.0-beta2
  - $ make CCDEBUG=-g CFLAGS=-O0
  - 注：要指定-O0，否则gdb的bt full会显示'value optimized out''
- JNLuaJIT
  - 进入src/main/c/Linux目录，编辑Makefile
  - 之前的LDFLAGS是用-s选项，而-s就是之前的说的stripped，将其替换为-g
  - 另外CFLAGS也增加-g，可以选择的增加-O0
  - 重新make编译
- lua cjson
  - 编辑Makefile
  - 打开调试信息的CFLAGS -g -Wall -pedantic -fno-inline
- 重新编译后，都可以用nm命令和gdb验证是否生成符号信息
gdb提示Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
- 按照提示直接安装debuginfo即可
- 安装后，gcc等底层库的调试信息也都有了
```
Reading symbols from /lib64/libgcc_s.so.1...Reading symbols from /usr/lib/debug/lib64/libgcc_s-4.4.7-20120601.so.1.debug...done.
done.
复制
```

battleserver crash的原因

同一个luastate多线程使用时crash
某些Lua代码在luajit模式执行下会变慢，也会crash
- 所以在服务器执行battle复盘的时候可以尝试关闭luajit模式，只有luajit的解释模式