利用backtrace诊断 Segment Fault 崩溃原因

Linux的一个daemon进程在非常偶然的情况下导致Segment Fault崩溃,基本上要半年才会发生一次。对于这种很难重现的问题,幸好进程中有如下代码中在崩溃时导出堆栈

void handler(int sig) {
  void *array[10];
  size_t size;

  // get void*'s for all entries on the stack
  size = backtrace(array, 10);

  fprintf( stderr, "Error: signal %d:\n", sig);
  backtrace_symbols_fd(array, size, STDERR_FILENO);
  exit(1);
}

static __attribute__((constructor)) void init() {
    signal(SIGSEGV, handler);    
}
Error: signal 11:
/app/node_modules/libmq.so(handler+0x23)[0x7f2451189d83]
/lib64/libc.so.6(+0x326a0)[0x7f24516526a0]
/lib64/libc.so.6(+0x13357f)[0x7f245175357f]
/app/node_modules/libmq.so(+0x2d7f2)[0x7f245118d7f2] <<-SIGSEGV
/app/node_modules/libmq.so(s_thread_shim+0x1f)[0x7f24511e22df]
/lib64/libpthread.so.0(+0x79d1)[0x7f24519bb9d1]
/lib64/libc.so.6(clone+0x6d)[0x7f24517088fd]

SIGSEGV 由(+0x2d7f2)处引发

然后IDA Pro打开libmq.so, 找到2D7F2偏移

.text:000000000002D7E3                 mov     rdi, r14        ; s
.text:000000000002D7E6                 mov     [rcx+30h], eax
.text:000000000002D7E9                 mov     [rsp+90h+var_90], rcx
.text:000000000002D7ED                 call    _strlen
.text:000000000002D7F2                 lea     rsi, [rax+1]    ; size  // <<---- SIGSEGV
.text:000000000002D7F6                 mov     edi, 1          ; nmemb
.text:000000000002D7FB                 mov     r12, rax
.text:000000000002D7FE                 call    _calloc
.text:000000000002D803                 test    rax, rax
.text:000000000002D806                 mov     rcx, [rsp+90h+var_90]
.text:000000000002D80A                 jz      loc_2D92A
.text:000000000002D810                 mov     [rcx+8], rax

最后定位到了C源码中,一个超低级的错误,没有检测空指针

size_t len = strlen(srv->channel); // <-- SIGSEGV
item->item.data.channel = zmalloc(len + 1);
if( item->item.data.channel ){
    strncpy( item->item.data.channel, srv->channel, len);
    item->item.data.channel[len] = 0;
}
Logo

更多推荐