[c/c++] tcmalloc大空闲内存不释放问题
一、现象1.线上服务(内存分配采用tcmalloc)出现瞬时内存上涨的情况(如按G级别上涨),且长时间不释放内存。二、问题定位1.确定内存是泄漏还是tcmalloc的空闲内存(我这边的是空闲内存,并不是内存泄漏)1)可以通过增加代码,支持打印tcmalloc内存分配统计信息。char* stats_buffer = new char[4096];MallocExtension::instance(
一、现象
1.线上服务(内存分配采用tcmalloc)出现瞬时内存上涨的情况(如按G级别上涨),且长时间不释放内存。
二、问题定位
1.确定内存是泄漏还是tcmalloc的空闲内存(我这边的是空闲内存,并不是内存泄漏)
1)可以通过增加代码,支持打印tcmalloc内存分配统计信息。
char* stats_buffer = new char[4096];
MallocExtension::instance()->GetStats(stats_buffer, 4096);
std::cout << stats_buffer << std::endl;
delete[] stats_buffer;
2)如果有tcmalloc代码,可以先下线有问题的服务(不是关闭程序,而是让正常流量不再流入该服务,以便通过gdb debug的时候,不影响正常用户),可以通过gdb,来打印当前程序的内存分配情况。
#gdb debug 正在运行的程序
gdb attach pid
#加载源代码
dir /tmp/tcmalloc/src
#打印当前可以定义为函数有哪些,应该可以看到TCMallocImplementation::GetStats
info functions GetStats
#打点TCMallocImplementation::GetStats函数
b TCMallocImplementation::GetStats
#自定义调用TCMallocImplementation::GetStats函数
#随意设置一个字符串变量,用于传参数
set $buffer="123"
print (void)MallocExtension::instance()->GetStats($buffer, 3)
#接着会跳转到TCMallocImplementation::GetStats,
#当遇到DumpStats函数输入s进入函数,
#当遇到ExtractStats函数的时候输入n完成该函数调用后,通过print stats,
#可以查看到tcmalloc的统计信息,如
$1 = {thread_bytes = 12866856, central_bytes = 22301112, transfer_bytes = 7452416, metadata_bytes = 9957528, pageheap = {system_bytes = 8751415296, free_bytes = 8691818496, unmapped_bytes = 491520}}
#空闲内存为pageheap.free_bytes,我这里可以看到8691818496,约占8G的空闲内存
2.定位产生大空闲内存原因(因为一般会有大空闲内存,也就先有大内存分配的需求,才会导致大空闲内存的产生)
1)pmap pid,tcmalloc是线程池模式的,而我们看到8G内存都是在低地址的(也就是brk产生的),而这对于tcmalloc来说,并不是线程池分配的,这一般是由于有大内存块分配需求(超过256K)。
0000000000b4b000 324K rw--- [ anon ]
0000000002133000 2148K rw--- [ anon ]
000000000234c000 8594880K rw--- [ anon ]
00007fdb0d58c000 4K ----- [ anon ]
00007fdb0d58d000 8192K rw--- [ anon ]
00007fdb0dd8d000 4K ----- [ anon ]
00007fdb0dd8e000 8192K rw--- [ anon ]
00007fdb0e58e000 4K ----- [ anon ]
00007fdb0e58f000 8192K rw--- [ anon ]
00007fdb0ed8f000 4K ----- [ anon ]
00007fdb0ed90000 8192K rw--- [ anon ]
00007fdb0f590000 4K ----- [ anon ]
00007fdb0f591000 8192K rw--- [ anon ]
2)定位大块内存分配的位置,查阅代码,大块内存分配过程:
tc_malloc->do_malloc_or_cpp_alloc->do_malloc->do_malloc_pages->ReportLargeAlloc
static void ReportLargeAlloc(Length num_pages, void* result) {
StackTrace stack;
stack.depth = GetStackTrace(stack.stack, tcmalloc::kMaxStackDepth, 1);
static const int N = 1000;
char buffer[N];
TCMalloc_Printer printer(buffer, N);
printer.printf("tcmalloc: large alloc %"PRIu64" bytes == %p @ ",
static_cast<uint64>(num_pages) << kPageShift,
result);
for (int i = 0; i < stack.depth; i++) {
printer.printf(" %p", stack.stack[i]);
}
printer.printf("\n");
write(STDERR_FILENO, buffer, strlen(buffer));
}
最终发现当出现超过1G(默认,也可以通过环境变令TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD设置改变)大内存块的时候, 会以开头"tcmalloc: large alloc"的告警信息写入到进程的错误日志,如cat std_err.log | grep tcmalloc:
tcmalloc: large alloc 4294967296 bytes == 0xe4bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0x10e8bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0xe4bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0xe4bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0xe4bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0xe4bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0x10e8bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0x10e8bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
tcmalloc: large alloc 4294967296 bytes == 0x10e8bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
3.将"tcmalloc: large alloc"告警信息,转换为代码调用栈,
1)将查询到的tcmalloc: large alloc 4294967296 bytes == 0xe4bc000 @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 中的红色部分拷贝到下面的内容中,替换对应的
heap profile: 36418: 16055130 [42282501: 24696338245] @ heapprofile
8: 6291456 [ 8: 6291456] @ 0x7cf964 0x536bd3 0x5dc2f1 0x5dc3fa 0x5dc515 0x5dcab0 0x5dbd32 0x5dcb92 0x5decf9 0x5dd635 0x696150 0x7fdb2c1addd5
然后保存为文件001.heap
2)转换为代码调用栈,便可以看到出问题的代码位置(可先通过yum install gperftools安装)
pprof --text --lines ./main ./001.heap | less
3)而我定位到的位置如下,该代码是在接受到一个包头,然后直接通过包头的长度来分配char数组,而如果当有错误数据包头,那么一次可能分配最大内存为4G,虽然后续会回收,但是内存会一直维持高内存状态:
char *pBuffer = new char[packetHeader.u32PacketLength];
三、问题解决:
1)对于我的场景,则是对于异常数据包,抛弃或者断开连接
更多推荐
所有评论(0)