1、遇到的问题:应用在hi3536上跑一段不固定的时间,随之就会出现重启的现象;打印如下;
app-run invoked oom-killer: gfp_mask=0x1042d0, order=3, oom_score_adj=0
CPU: 0 PID: 1299 Comm: ckdecoder Tainted: P O 3.10.0_hi3536 #2
[<c0019d30>] (unwind_backtrace+0x0/0xf4) from [<c0016de4>] (show_stack+0x10/0x14)
[<c0016de4>] (show_stack+0x10/0x14) from [<c051ea44>] (dump_header.isra.10+0x7c/0x194)
[<c051ea44>] (dump_header.isra.10+0x7c/0x194) from [<c0091dec>] (oom_kill_process+0x278/0x3e8)
[<c0091dec>] (oom_kill_process+0x278/0x3e8) from [<c00923e0>] (out_of_memory+0x28c/0x2b0)
[<c00923e0>] (out_of_memory+0x28c/0x2b0) from [<c0095590>] (__alloc_pages_nodemask+0x690/0x6a8)
[<c0095590>] (__alloc_pages_nodemask+0x690/0x6a8) from [<c00955b8>] (__get_free_pages+0x10/0x24)
[<c00955b8>] (__get_free_pages+0x10/0x24) from [<c00dd784>] (seq_buf_alloc+0x10/0x34)
[<c00dd784>] (seq_buf_alloc+0x10/0x34) from [<c00dd914>] (traverse+0x16c/0x1e8)
[<c00dd914>] (traverse+0x16c/0x1e8) from [<c00dda14>] (seq_lseek+0x84/0x110)
[<c00dda14>] (seq_lseek+0x84/0x110) from [<c010ddcc>] (proc_reg_llseek+0x68/0xa0)
[<c010ddcc>] (proc_reg_llseek+0x68/0xa0) from [<c00bea74>] (SyS_lseek+0x60/0x84)
[<c00bea74>] (SyS_lseek+0x60/0x84) from [<c0012f80>] (ret_fast_syscall+0x0/0x30)
Mem-info:
Normal per-cpu:
CPU 0: hi: 42, btch: 7 usd: 0
CPU 1: hi: 42, btch: 7 usd: 40
CPU 2: hi: 42, btch: 7 usd: 6
CPU 3: hi: 42, btch: 7 usd: 0
active_anon:9622 inactive_anon:0 isolated_anon:0 (后面的打印省略了... ...)
二、初步排查
2.1、使用gdb调试时,出现上述错误时,无堆栈信息;
2.2、跑应用时,用free -m查看时,空闲的内存一直往下掉;查看代码中的malloc内存分配相关的代码,分配的内存都有free;
2.3、从现象看,解码路数多时,oom错误更容易出现;解码路数少时,oom错误不是那么容易出现,初步怀疑是解码的代码出问题,但是查看解码的代码,并无发现明显的异常;
2.4、刚准备用memleak查看内存泄露的问题,实际后续并未使用;
2.5、决定把代码简化,去掉一些线程(管理线程、网络线程、串口通信线程),协助定位,定位到是串口线程导致内存泄漏,查看串口相关的线程,发现查询解码器状态的节点,一直在循环open,没在close;
于是每次open,close掉,现次验证,用free -m查看,内存没有再一直往下掉了;
三、收获
1、产生oom-killer错误,也不一定是malloc分配的内存没有回收造成的;
2、gdb调试这类错误,既然也会出现无堆栈的情况,应该是内存耗完了导致的;
3、养成好习惯,mallloc和free,open和close、fopen和fclose要配对使用;