[转帖]Core dump实战分析之Java版

core,dump,实战,分析,java · 浏览次数 : 0

小编点评

The context describes how to analyze a Java core dump using jstack and jmap tools, and how to interpret the obtained information. **Step 1: Using jstack to analyze the core dump** * Run the `jstack --helpjstack` command to get help on using jstack with a core dump. * Connect to a core file using `jstack /landon/lib/jdk/bin/java core.30298 >> core.30298.jstack.log`. * This command will display the thread state, including call stack information, for each thread in the core dump. **Step 2: Using jmap to analyze the core dump** * Run the `jmap --helpjmap` command to get help on using jmap with a core dump. * Connect to a core file using `jmap --helpjmap [option] <executable <core>`. * Use the following options with `jmap`: * `-a`: Specify the assembly file to analyze. * `-l`: Specify the number of threads to analyze. * This allows you to analyze specific threads within the core dump. **Step 3: Interpreting the generated information** * Look for lines in the output of `jstack` that indicate the size of the thread stack and other information. * Use `jmap` to analyze specific threads and look for relevant information such as call stack frames, debugging symbols, and thread state. **Additional Notes:** * The context mentions that some information may be unavailable due to stripping. * The `file` command can be used to check if a core file is stripped and, if so, how to access the debugging information. * The context also recommends using the `debuginfo-install` tool to install additional debugging information for the core.

正文

https://www.jianshu.com/p/2cdf71f99209

 

Core dump实战分析(Java版)

背景

  1. 项目中的battleserver进程在某一段时间总是crash,无法找到具体Crash原因
    • Java通过JNI调用Luajit
  2. 那么进程Crash如何找到JNI的堆栈(C层),进而确认底层问题呢?

Crash分析之hs_error

  1. Java进程Crash后通常会生成一个hs_error%pid.log

    • %p指进程的pid
    • hs_error指HotSpot JVM error
  2. 该文件的位置可通过JVM参数'-XX:ErrorFile'指定如

    • -XX:ErrorFile=/landon/business/battle/hs_error%p.log
  3. 一个典型的hs_error文件内容如下

    • 第一部分文件头

      • Crash原因:SIGSEGV (0xb)
      • Problematic frame:C,即堆栈出现在了C层(native code)
      • Core dump已经写入,如果没有写入,会提示
        • Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
        • 如果要开启core dump,一定要打开ulimit -c参数
      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGSEGV (0xb) at pc=0x00007f838c76f1ae, pid=8257, tid=0x00007f838b5b6700
      #
      # JRE version: Java(TM) SE Runtime Environment (8.0_221-b11) (build 1.8.0_221-b11)
      # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.221-b11 mixed mode linux-amd64 compressed oops)
      # Problematic frame:
      # C  [libluajit-5.1.so.2+0xc1ae]
      #
      # Core dump written. Default location: /landon/battle_server/online/core or core.8257
      
    • 第二部分是Crash的线程信息

      • 这个看到出问题的线程是SOFA-SEV-BOLT-BIZ-12200-3-T11
      • _thread_in_native表示在执行native代码
      • 中间一部分是寄存器上下文
        • 从这里可以看到各寄存器的内存映射,从libjnlua5.1.so到ibjvm.so
        • RSP堆栈指针指向了当前线程
      • 最后一部分是堆栈信息
        • 从Java frames到Native frames
        • 从这里发现:无法看到Native frames
      ---------------  T H R E A D  ---------------
      
      Current thread (0x00007f83a001e000):  JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)]
      
      siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000004
      
      Registers:
      Top of Stack: (sp=0x00007f838b5b4f70)
      Instructions
      
      Register to memory mapping:
      
      RAX=0x0000000000000000 is an unknown value
      RBX=0x0000000040ffb9c0 is an unknown value
      RCX=0x0000000000000000 is an unknown value
      RDX=0x00000000489f80d0 is an unknown value
      RSP=0x00007f838b5b4f70 is pointing into the stack for thread: 0x00007f83a001e000
      RBP=0x00000000407c23b8 is an unknown value
      RSI=0x0000000041cb0e88 is an unknown value
      RDI=0x00000000407c23b8 is an unknown value
      R8 =0x00000007bfc84540 is an oop
      com.naef.jnlua.LuaState 
       - klass: 'com/naef/jnlua/LuaState'
      R9 =0x00000007bfc84540 is an oop
      com.naef.jnlua.LuaState 
       - klass: 'com/naef/jnlua/LuaState'
      R10=0x00000000000006ba is an unknown value
      R11=0x00007f843da2cf3c: <offset 0xfccf3c> in /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so at 0x00007f843ca60000
      R12=0x0000000000000000 is an unknown value
      R13=0x00007f838b5b50d0 is pointing into the stack for thread: 0x00007f83a001e000
      R14=0x00000000407c2fa8 is an unknown value
      R15=0x00007f838c9e43e0: <offset 0xa3e0> in /landon/lib/libjnlua5.1.so at 0x00007f838c9da000
      
      
      Stack: [0x00007f838b4b6000,0x00007f838b5b7000],  sp=0x00007f838b5b4f70,  free space=1019k
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      C  [libluajit-5.1.so.2+0xc1ae]
      
      Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
      J 4447  com.naef.jnlua.LuaState.lua_gc(II)I (0 bytes) @ 0x00007f84291dd278 [0x00007f84291dd240+0x38]
      J 5983 C2 com.landon30.jlua.pool.JLua.call(Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (206 bytes) @ 0x00007f842946e01c [0x00007f842946dbe0+0x43c]
      J 6040 C1 com.landon30.jlua.LuaCallManager.callLua(Ljava/util/Set;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (474 bytes) @ 0x00007f842965fd1c [0x00007f842965f4e0+0x83c]
      
    • 第三,四部分主要是Crash时进程和系统信息

      • 包括所有的Java线程以及Current thread
      • 还包括Java的堆信息等
      • 最后是系统信息
      ---------------  P R O C E S S  ---------------
      
      Java Threads: ( => current thread )
      =>0x00007f83a001e000 JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)]
      
      VM state:not at safepoint (normal execution)
      VM Mutex/Monitor currently owned by a thread: None
      
      Heap:
      Card table byte_map:
      Marking Bits:
      Polling page:
      CodeCache:
      
      Compilation events:
      GC Heap History:
      Deoptimization events:
      Classes redefined:
      Internal exceptions:
      Events:
      
      Dynamic libraries:
      
      VM Arguments:
      Environment Variables:
      Signal Handlers:
      
      ---------------  S Y S T E M  ---------------
      
  4. hs_error总结

    • 通常在Java层面的crash,直接通过hs_error就可以直接定位,因为有堆栈信息
    • 不过如果是C层面的Crash,则无堆栈
    • 此时则需要分析Core dump了,主要前提是要打开ulimit -c

Crash分析之Core dump

  1. 通过上面的分析,打开ulimit -c后就会生成Core dump

    • 6.9G Nov 2 21:58 core.26972
    • 注:因Coredump文件较大,所以通常在进程Crash的写Coredump时cpu load和iowait都较高
  2. 分析Core dump的一个工具是gdb,所以必须安装gdb

    $ gdb --version
    GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
    
  3. 分析Java进程crash生成的coredump

    1. 用gdb打开
    • $ gdb /landon/lib/jdk/bin/java core.26972
    1. 输入bt或bt full命令查看backtrace

      • 从这里就可以看到一些实际的C堆栈
      • 如#15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
      (gdb) bt
      #0  0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
      #1  0x00007f2169c71cd5 in abort () at abort.c:92
      #2  0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #3  0x00007f2169725733 in VMError::report_and_die() ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #4  0x00007f216956aa45 in JVM_handle_linux_signal ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #5  0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #6  <signal handler called>
      #7  0x00007f21380cbb8d in ?? ()
      #8  0x0000000000000001 in ?? ()
      #9  0x000000005d5fdfa8 in ?? ()
      #10 0x000000005d696b20 in ?? ()
      #11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2
      #12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      #13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      #14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      #15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
      #16 0x00007f2155595eaa in ?? ()
      #17 0x0000000669220970 in ?? ()
      #18 0x000000075f554730 in ?? ()
      #19 0x0000000000000000 in ?? ()
      
    2. btfull命令可以看到详细的信息包括调用参数等

      • 此时结合源代码如 lj_state.c:71 + 参数分析原因
      (gdb) bt full
      #0  0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
              resultvar = 0
              pid = <value optimized out>
              selftid = <value optimized out>
      #1  0x00007f2169c71cd5 in abort () at abort.c:92
              save_stage = 2
              act = {__sigaction_handler = {sa_handler = 0x7f20b70ede80, sa_sigaction = 0x7f20b70ede80}, sa_mask = {__val = {
                    139779442242560, 0, 139781486046377, 0, 139781485559377, 139781480345909, 5872491825372126914, 1925, 335544324, 
                    139781480145399, 10, 139781480027928, 55, 1, 0, 0}}, sa_flags = 0, sa_restorer = 0x260}
              sigs = {__val = {32, 0 <repeats 15 times>}}
      #2  0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #3  0x00007f2169725733 in VMError::report_and_die() ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #4  0x00007f216956aa45 in JVM_handle_linux_signal ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #5  0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #6  <signal handler called>
      No symbol table info available.
      #7  0x00007f21380cbb8d in ?? ()
      No symbol table info available.
      #8  0x0000000000000001 in ?? ()
      No symbol table info available.
      #9  0x000000005d5fdfa8 in ?? ()
      No symbol table info available.
      #10 0x000000005d696b20 in ?? ()
      No symbol table info available.
      #11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      #12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      ---Type <return> to continue, or q <return> to quit---
      #13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      #14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      #15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
              st = 0x7f2114c63f66
              oldst = 0x7f20b70ef370
              delta = 0
              oldsize = 3071210352
              realsize = 32544
              up = 0x0
      #16 0x00007f2155595eaa in ?? ()
      No symbol table info available.
      #17 0x0000000669220970 in ?? ()
      No symbol table info available.
      #18 0x000000075f554730 in ?? ()
      No symbol table info available.
      #19 0x0000000000000000 in ?? ()
      No symbol table info available.
      
    3. 按q退出

    4. gdb还有许多其他命令,这里就不一一细说

  4. 使用jvm原生命令分析coredump

    1. 使用jstack分析

      • $ jstack /landon/lib/jdk/bin/java core.30298 >> core.30298.jstack.log
      • 从这里可以看到所有的线程堆栈,而这个是hs_error没有的,hs_error只能看到当前线程堆栈和所有的线程
      Deadlock Detection:
      No deadlocks found.
      
      Thread 30402: (state = IN_NATIVE)
       - com.naef.jnlua.LuaState.lua_pcall(int, int) 
    2. 使用jmap分析

      • $ jmap /landon/lib/jdk/bin/java core.30298

        ...
        0x00007f274a5eb000      55K     /landon/lib/libjnlua5.1.so
        0x00007f274a374000      474K    /lanodn/lib/libluajit-5.1.so.2
        ...
        
      • $ jmap -dump:live,format=b,file=30298.hprof /landon/lib/jdk/bin/java core.30298

        • 即将coredump文件转为hprof,这样就可以用如mat工具等分析内存相关了
        • 此执行时间较长,线上慎用
    3. 注:其实jstack和jmap的help文档都有说明分析coredump

      $ jstack --help
      jstack [-m] [-l] <executable> <core>
              (to connect to a core file)
              
      $ jmap --help
      jmap [option] <executable <core>
              (to connect to a core file)
      

关于调试信息

  1. 从gdb分析堆栈时,经常看到一些这样的信息 'No symbol table info available.'

    • 没有可用符号表

    • 其实简单来说就是调试信息,没有调试信息就无法看到详细的堆栈

      • 这里延伸到Java,javac编译的时候

        -g:none                    Generate no debugging info
        -g:{lines,vars,source}     Generate only some debugging info
        
  2. 如何看一个so库是否有调试信息呢

    • 使用nm命令:List symbols in [file(s)]

      // 无符号表的一个so
      $ nm -a libjnlua5.1.so_no_debug
      nm: libjnlua5.1.so_no_debug: no symbols
      
      $ nm -a libjnlua5.1.so
      000000000020f818 b .bss
      0000000000000000 n .comment
      ......
      
      // 查看行号
      $ nm -l libjnlua5.1.so | head
      0000000000009a50 T JNI_OnLoad   /laodon/lib/JNLuaJIT-master/src/main/c/Linux/../jnlua.c:1911
      
    • 直接使用gdb命令即可

      $ gdb libjnlua5.1.so
      
      // 如果有符号表,会直接读取成功,否则
      Reading symbols from /landon/lib/libjnlua5.1.so...done.
      
      // 没有则直接提示 no debugging symbols found
      Reading symbols from /landon/lib/libjnlua5.1.so_no_debug...(no debugging symbols found)...done.
      
    • 还有一个很重要的是要查看文件是否被stripped,如果被stripped,则一定没有调试信息

      • 使用file命令查看即可
      // 这个显示stripped,则一定没有调试信息
      $ file libjnlua5.1.so_no_debug 
      libjnlua5.1.so_no_debug: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped
      
      // 显示not stripped
      $ file libjnlua5.1.so
      libjnlua5.1.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped
      
  3. 如何生成调试信息:以三个例子说明

    • LuaJIT-2.1.0-beta2
      • $ make CCDEBUG=-g CFLAGS=-O0
      • 注:要指定-O0,否则gdb的bt full会显示'value optimized out''
    • JNLuaJIT
      • 进入src/main/c/Linux目录,编辑Makefile
      • 之前的LDFLAGS是用-s选项,而-s就是之前的说的stripped,将其替换为-g
      • 另外CFLAGS也增加-g,可以选择的增加-O0
      • 重新make编译
    • lua cjson
      • 编辑Makefile
      • 打开调试信息的CFLAGS -g -Wall -pedantic -fno-inline
    • 重新编译后,都可以用nm命令和gdb验证是否生成符号信息
  4. gdb提示Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64

    • 按照提示直接安装debuginfo即可

    • 安装后,gcc等底层库的调试信息也都有了

      Reading symbols from /lib64/libgcc_s.so.1...Reading symbols from /usr/lib/debug/lib64/libgcc_s-4.4.7-20120601.so.1.debug...done.
      done.
      

battleserver crash的原因

  1. 同一个luastate多线程使用时crash
  2. 某些Lua代码在luajit模式执行下会变慢,也会crash
    • 所以在服务器执行battle复盘的时候可以尝试关闭luajit模式,只有luajit的解释模式

与[转帖]Core dump实战分析之Java版相似的内容:

[转帖]Core dump实战分析之Java版

https://www.jianshu.com/p/2cdf71f99209 Core dump实战分析(Java版) 背景 项目中的battleserver进程在某一段时间总是crash,无法找到具体Crash原因 Java通过JNI调用Luajit 那么进程Crash如何找到JNI的堆栈(C层)

[转帖]CoreDump设置方式

https://www.jianshu.com/p/f5c3134072d2 本文讲述利用coredump调试时,对coredump信息相关的设置方式。 设置core文件大小 列出所有资源的限制 #ulimit -a 1240 查看core file size #ulimit -c 这里对于 cor

[转帖]Linux句柄调优之nofile、nr_open、file-max

https://www.jianshu.com/p/8fb056e7b9f8 在开发运维的时候我们常常会遇到类似“Socket/File: Can’t open so many files”,“无法打开更多进程”,或是coredump过大等问题,这些都可以设置资源限制来解决。今天在教某位客户设置最大

[转帖]Linux句柄调优之nofile、nr_open、file-max

https://www.jianshu.com/p/8fb056e7b9f8 在开发运维的时候我们常常会遇到类似“Socket/File: Can’t open so many files”,“无法打开更多进程”,或是coredump过大等问题,这些都可以设置资源限制来解决。今天在教某位客户设置最大

[转帖]踩内存问题分析工具

序 踩内存问题,大家都知道,是一个比较难分析的问题。 踩内存问题被发现,通常是程序崩溃的时候,能够生成coredump分析,知道是哪个内存被踩了,但通常是很难分析出是哪段代码出现了踩内存的问题。 本文会介绍几种分析踩内存问题的工具,有些工具是最近发现的,我还没有大量使用过,所以只是个简单的介绍,各位

[转帖]Linux句柄调优之nofile、nr_open、file-max

https://www.jianshu.com/p/8fb056e7b9f8 在开发运维的时候我们常常会遇到类似“Socket/File: Can’t open so many files”,“无法打开更多进程”,或是coredump过大等问题,这些都可以设置资源限制来解决。今天在教某位客户设置最大

[转帖]Kdump调试机理详细总结(一)

https://blog.csdn.net/luckiers/category_11796393.html 一、简介 本文主要讲解Kdump的运行机理,后续两个章节会详细介绍kdump的使用和如何分析coredump文件信息,具体链接如下: Kdump配置及使用详细总结(二) crash工具分析vm

[转帖]GDB调试core文件

在Linux环境下C程序经常会出现A segmentation fault(段错误),如果我们的程序只有几十行,那么我们可以通过printf输出调试来找到哪个地方出现了异常,但如果是在项目中,如果我们还是通过print找查找错误,那么效率会很低。那么我们来学习一下Linux环境下通过core文件来找

[转帖]【MySQL】MySQL 8.0 redo log写入性能问题分析

http://kernelmaker.github.io/MySQL_8_core 最近对比了MySQL 5.6和8.0在8核环境下oltp_write_only的性能,发现8.0写入性能(QPS 6-7万)反而低于5.6版本的(QPS 14万),所以进一步测试分析了下redo log这里可能导致性

[转帖]net.ipv4.tcp_max_syn_backlog & net.core.somaxconn

https://www.cnblogs.com/apink/p/15632882.html TCP SYN_REVD, ESTABELLISHED 状态对应的队列 TCP 建立连接时要经过 3 次握手,在客户端向服务器发起连接时,对于服务器而言,一个完整的连接建立过程,服务器会经历 2 种 TCP