基于Linux内核的应用探测: uprobe

基于Linux内核的应用性能分析技术Linux内核有良好的分层设计，但了解并实时跟踪内核的行为并不容易。为此，Linux内核开发者实现了跟踪点(tracepoint)、性能监测(perf_event)、函数跟踪(ftrace)等功能，用于Linux的调试和性能分析。最近几年已经成熟的eBPF内核子系统带来了内核性能分析技术的飞跃，常用的主要有bcc、bpftrace；二者同属于github的iov

塵觴葉

4169人浏览 · 2022-03-13 20:31:28

塵觴葉 · 2022-03-13 20:31:28 发布

基于Linux内核的应用性能分析技术

Linux内核有良好的分层设计，但了解并实时跟踪内核的行为并不容易。为此，Linux内核开发者实现了跟踪点(tracepoint)、性能监测(perf_event)、函数跟踪(ftrace)等功能，用于Linux的调试和性能分析。最近几年已经成熟的eBPF内核子系统带来了内核性能分析技术的飞跃，常用的主要有bcc、bpftrace；二者同属于github的iovisor用户，直译出来即为“输入输出监测”，即性能监测。这些工具同样可用于应用层的性能分析，这是笔者关注的功能。不过二者在嵌入式设备支持不够完善，原因在于都依赖llvm编译器功能，用于将C代码或bpftrace代码编译成eBPF字节码，并加载到Linux内核中执行。早在2012年Linux内核就加入了uprobe应用探测功能，并提供了sysfs接口；该接口不需要向Linux内核加载eBPF，比较方便在嵌入式设备上使用。不过，bpftrace等工具提供的uprobe探测功能并不依赖sysfs接口，而是直接向Linux内核注入eBPF字节码。

uprobe跟踪简单应用函数的入参

Linux内核的官方文档对uprobe的接口作了详细的说明，可供参考。笔者编写了以下简单代码：

/* file: registers.c */
/* force usage of `int strerror_r(...), instead of the GNU specific one */
#ifdef _GNU_SOURCE
#undef _GNU_SOURCE
#endif

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <sys/types.h>
#include <unistd.h>

int wrapper_strerror(int error,
	unsigned long stkptr, char * description, size_t desclen) __attribute__((noinline));

static void nano_sleep(unsigned int msec)
{
	struct timespec spec;
	spec.tv_sec = (time_t) (msec / 1000);
	spec.tv_nsec = (long) ((msec % 1000) * 1000000);
	nanosleep(&spec, NULL);
}

#define REG_BUFSIZ      512
#define REG_ERRNO_MAX   200
int main(int argc, char *argv[])
{
	int idx;
	char * errmsg;
	unsigned long stkval;

	fprintf(stdout, "Address of wrapper_strerror: %p, pid: %ld\n",
		(void *) wrapper_strerror, (unsigned long) getpid());
	fflush(stdout);
	errmsg = (char *) malloc(REG_BUFSIZ);
	if (errmsg == NULL)
		exit(1);
	for (idx = 1; idx < REG_ERRNO_MAX; ++idx) {
		stkval = 0;
		memset(errmsg, 0, REG_BUFSIZ);
		nano_sleep(5000);
#if 0
		__asm__ volatile ("movq %%rsp, %0" : "=r"(stkval));
#else
		__asm__ volatile ("mov %0, sp" : "=r"(stkval));
#endif
		wrapper_strerror(idx,
			stkval, errmsg, (size_t) (REG_BUFSIZ - idx));
	}

	free(errmsg);
	return 0;
}

int wrapper_strerror(int error,
	unsigned long stkptr, char * description, size_t desclen)
{
	int ret;

	ret = strerror_r(error, description, desclen);
	if (ret) {
		fprintf(stderr, "Error, strerror_r(%d) has failed with %d: %s\n",
			error, ret, strerror(ret));
		fflush(stderr);
		return -1;
	}

	fprintf(stdout, "ra: %p, stack pointer: %p, errno[%03d]: (%p) %s\n",
		(void *) __builtin_return_address(0),
		(void *) stkptr, error, description, description);
	fflush(stdout);
	return error;
}

为嵌入式ARM设备编译命令为（上面的代码用到了内联汇编以获得应用的栈指针，其他架构的平台会编译出错）：

aarch64-linux-gnu-gcc -O0 -Wall -o registers registers.c
cp registers /tmp/

编译完成后，将registers可执行文件复制到/tmp目录下。笔者希望通过uprobe功能跟踪得到函数wrapper_strerror的入参。上面的代码已将参数在函数中写入柡准输出，可以用于uprobe跟踪结果的对比。笔者编写了以下配置uprobe接口的脚本，用以跟踪/tmp/registers：

#!/bin/bash

pdir='/sys/kernel/debug/tracing'
cd "${pdir}" || exit $?

# clear all events
echo 0 >  tracing_on
if [ -e events/uprobes/enable ] ; then
	echo 0 > events/uprobes/enable
fi
echo > uprobe_events

echo 'p:wrapper_entry /tmp/registers:0xbf8 %x0 %x1 %x2 %x3 %x30' > uprobe_events
sleep 1 # delay one seconds
echo 1 > events/uprobes/enable
echo 1 > tracing_on

注意到以上脚本中的0xbf8，该值是一个偏移量，是函数wrapper_strerror在可执行文件/tmp/registers的代码段中的偏移量。该偏移量可以通过aarch64-linux-gnu-objdump、aarch64-linux-gnu-readelf等工具获取，这里不便展开。之后的%x0 %x1 %x2 %x3 %x30五个64位ARM寄存器分别对应函数wrapper_strerror的四个参数和函数调用时返回地址。上面的C代码中用内联汇编获得了函数调用前的栈指针并输出到终端，该值应与uprobe跟踪的结果一致。

以root权限执行以上脚本，并打开Linux内核的跟踪输出文件：

./uprobe-trace.sh
cat /sys/kernel/tracing/trace_pipe

注意，第二条cat命令会阻塞，这是正常的。之后，在另一个终端命令行下执行/tmp/registers，可得到应用的运行输出：

$ /tmp/registers 
Address of wrapper_strerror: 0x5562be0bf8, pid: 7694
ra: 0x5562be0bc8, stack pointer: 0x7ff3f60e70, errno[001]: (0x559a6746b0) Operation not permitted
ra: 0x5562be0bc8, stack pointer: 0x7ff3f60e70, errno[002]: (0x559a6746b0) No such file or directory
ra: 0x5562be0bc8, stack pointer: 0x7ff3f60e70, errno[003]: (0x559a6746b0) No such process
ra: 0x5562be0bc8, stack pointer: 0x7ff3f60e70, errno[004]: (0x559a6746b0) Interrupted system call

同时，在有root权限的终端（即执行cat /sys/kernel/tracing/trace_pipe的终端）可得到uprobe的跟踪输出结果：

       registers-7694    [000] .... 110526.657845: wrapper_entry: (0x5562be0bf8) arg1=0x1 arg2=0x7ff3f60e70 arg3=0x559a6746b0 arg4=0x1ff arg5=0x5562be0bc8
       registers-7694    [000] .... 110531.658272: wrapper_entry: (0x5562be0bf8) arg1=0x2 arg2=0x7ff3f60e70 arg3=0x559a6746b0 arg4=0x1fe arg5=0x5562be0bc8
       registers-7694    [000] .... 110536.658639: wrapper_entry: (0x5562be0bf8) arg1=0x3 arg2=0x7ff3f60e70 arg3=0x559a6746b0 arg4=0x1fd arg5=0x5562be0bc8
       registers-7694    [000] .... 110541.659019: wrapper_entry: (0x5562be0bf8) arg1=0x4 arg2=0x7ff3f60e70 arg3=0x559a6746b0 arg4=0x1fc arg5=0x5562be0bc8

可见，uprobe的跟踪结果与应用的输出结果一致。此时若用gdb加载运行中的进程，可以看到函数wrapper_strerror:的第一条机器指令被替换成了brk指令，用于应用陷入Linux内核(这样内核会抓取应用的状态并进行性能分析)：

(gdb) disassemble wrapper_strerror,+0x10
Dump of assembler code from 0x5555550bf8 to 0x5555550c08:
   0x0000005555550bf8 <wrapper_strerror+0>:	brk	#0x5
   0x0000005555550bfc <wrapper_strerror+4>:	mov	x29, sp
   0x0000005555550c00 <wrapper_strerror+8>:	str	x19, [sp, #16]
   0x0000005555550c04 <wrapper_strerror+12>:	mov	x19, x30
End of assembler dump.

上面由Linux内核替换为brk的指令，会在Linux内核中执行（在停止应用探测前，内核不会将其还原），这是基于eBPF的探测功能高效的原因。停止应用探测功能的操作如下：

echo 0 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/kernel/debug/tracing/events/uprobes/enable

使用bpftrace监测应用的malloc调用

使用GDB调试工具可以方便地获得一个应用对malloc函数调用的结果。这个调试过程并不是交互的，使用commands命令，GDB可以在断点触发后自动继续运行应用：

(gdb) break __libc_malloc
Breakpoint 2 at 0x7ff7ebeaa0: file malloc.c, line 3031.
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>i r x0
>bt
>cont
>end
(gdb) c
Continuing.

Breakpoint 2, __GI___libc_malloc (bytes=1024) at malloc.c:3031
3031	malloc.c: No such file or directory.
x0             0x400               1024
#0  __GI___libc_malloc (bytes=1024) at malloc.c:3031
#1  0x0000007ff7ea8224 in __GI__IO_file_doallocate (fp=0x7ff7fb3580 <_IO_2_1_stdout_>) at filedoalloc.c:101
#2  0x0000007ff7eb7630 in __GI__IO_doallocbuf (fp=0x7ff7fb3580 <_IO_2_1_stdout_>) at libioP.h:948
#3  __GI__IO_doallocbuf (fp=fp@entry=0x7ff7fb3580 <_IO_2_1_stdout_>) at genops.c:342
#4  0x0000007ff7eb6908 in _IO_new_file_overflow (f=0x7ff7fb3580 <_IO_2_1_stdout_>, ch=-1) at fileops.c:745
#5  0x0000007ff7eb5a38 in _IO_new_file_xsputn (n=9, data=<optimized out>, f=0x7ff7fb3580 <_IO_2_1_stdout_>) at libioP.h:948
#6  _IO_new_file_xsputn (f=0x7ff7fb3580 <_IO_2_1_stdout_>, data=<optimized out>, n=9) at fileops.c:1197
#7  0x0000007ff7e9f2e4 in __vfprintf_internal (s=0x7ff7fb3580 <_IO_2_1_stdout_>, format=0x5555550b80 "\t/pid == %ld/\n", ap=..., mode_flags=mode_flags@entry=0) at ../libio/libioP.h:948
#8  0x0000007ff7e8f454 in __fprintf (stream=<optimized out>, format=<optimized out>) at fprintf.c:32
#9  0x0000005555550a10 in main (argc=1, argv=0x7ffffff2e8) at memory.c:35

Breakpoint 2, __GI___libc_malloc (bytes=1024) at malloc.c:3031
3031	in malloc.c
x0             0x400               1024
#0  __GI___libc_malloc (bytes=1024) at malloc.c:3031
#1  0x0000007ff7ea8224 in __GI__IO_file_doallocate (fp=0x7ff7fb3580 <_IO_2_1_stdout_>) at filedoalloc.c:101
#2  0x0000007ff7eb7630 in __GI__IO_doallocbuf (fp=0x7ff7fb3580 <_IO_2_1_stdout_>) at libioP.h:948
#3  __GI__IO_doallocbuf (fp=fp@entry=0x7ff7fb3580 <_IO_2_1_stdout_>) at genops.c:342
#4  0x0000007ff7eb6908 in _IO_new_file_overflow (f=0x7ff7fb3580 <_IO_2_1_stdout_>, ch=-1) at fileops.c:745
#5  0x0000007ff7eb5a38 in _IO_new_file_xsputn (n=9, data=<optimized out>, f=0x7ff7fb3580 <_IO_2_1_stdout_>) at libioP.h:948
#6  _IO_new_file_xsputn (f=0x7ff7fb3580 <_IO_2_1_stdout_>, data=<optimized out>, n=9) at fileops.c:1197
#7  0x0000007ff7e9f2e4 in __vfprintf_internal (s=0x7ff7fb3580 <_IO_2_1_stdout_>, format=0x5555550b80 "\t/pid == %ld/\n", ap=..., mode_flags=mode_flags@entry=0) at ../libio/libioP.h:948
#8  0x0000007ff7e8f454 in __fprintf (stream=<optimized out>, format=<optimized out>) at fprintf.c:32
#9  0x0000005555550a10 in main (argc=1, argv=0x7ffffff2e8) at memory.c:35
	/pid == 7853/
Press any key to continue... 
Breakpoint 2, __GI___libc_malloc (bytes=1024) at malloc.c:3031
3031	in malloc.c
x0             0x400               1024
#0  __GI___libc_malloc (bytes=1024) at malloc.c:3031
#1  0x0000007ff7ea8224 in __GI__IO_file_doallocate (fp=0x7ff7fb28d0 <_IO_2_1_stdin_>) at filedoalloc.c:101
#2  0x0000007ff7eb7630 in __GI__IO_doallocbuf (fp=0x7ff7fb28d0 <_IO_2_1_stdin_>) at libioP.h:948
#3  __GI__IO_doallocbuf (fp=fp@entry=0x7ff7fb28d0 <_IO_2_1_stdin_>) at genops.c:342
#4  0x0000007ff7eb66b4 in _IO_new_file_underflow (fp=0x7ff7fb28d0 <_IO_2_1_stdin_>) at fileops.c:486
#5  0x0000007ff7eb76f0 in __GI__IO_default_uflow (fp=0x7ff7fb28d0 <_IO_2_1_stdin_>) at libioP.h:948
#6  0x0000005555550a48 in main (argc=1, argv=0x7ffffff2e8) at memory.c:39

不过这样的调试方法仅限于“调试”，而不建议用于应用的性能分析。GDB的调试过程会严重干扰应用的正常运行，断点的停止与恢复需要消耗很多的时间，效率低下。这也是uprobe（及eBPF相关的Linux内核跟踪功能）的不同之处：这些应用的状态记录、分析、汇聚是在Linux内核中直接完成的，可以做到低延时(Low Latency)，尽可能地不影响应用的运行。应用的内存泄露问题通常比其他内存问题（如应用访问非法内存而崩溃）解决起来要棘手得多，主要原因在于内存泄露不会影响进程的运行，只会在系统的内存资源消耗殆尽时体现。应用可能会因无法申请到新的内存而崩溃，但崩溃的代码常常不是存在内存泄露问题的代码。一种可行的方案是将所有进行内存分配、释放的大小、指针、调用栈记录下来（并作统计分析），从而进一步确认应用进程何处一直在申请内存却未释放。

笔者编写了以下简单C代码，用以演示使用bpftrace工具跟踪该应用的内存申请操作：

/* file: memory.c ; aarch64-linux-gnu-gcc -Wall -O0 -o memory memory.c */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>

#define malloc_func(level_x, level_y) \
static void * malloc_ ## level_x(size_t msiz) { \
	return malloc_ ## level_y(msiz); \
}

static void * malloc_3(size_t ms)
{
	return malloc(ms);
}

malloc_func(2, 3)
malloc_func(1, 2)
malloc_func(0, 1)

int main(int argc, char *argv[])
{
	int ret;
	void * mptr;

	fprintf(stdout, "\t/pid == %ld/\n",
		(long) getpid());
	fputs("Press any key to continue... ", stdout);
	fflush(stdout);
	ret = getchar();
	fprintf(stdout, "\nYou pressed %c\n", (char) ret);
	fflush(stdout);

	mptr = malloc_0(0x2022);
	fprintf(stdout, "malloc has returned: %p\n", mptr);
	fflush(stdout);
	free(mptr);
	return 0;
}

在调用malloc函数进行内存分配之前，该简单应用会输出进程的PID，并等待用户按任意键。这样做的目地是为了有时间修改下面的bpftrace脚本（文件名为memory.bt），将XXX替换为被跟踪进程的PID（bpftrace脚本支持$1、$2等用以引入命令行参数，但未能找到用于PID匹配的方法）：

#!/usr/bin/bpftrace
uprobe:/usr/lib/aarch64-linux-gnu/libc-2.31.so:__libc_malloc /pid == XXX/ {
	printf("tid: %d => malloc'ing(%x)...", tid, arg0);
	print(ustack());
}
uretprobe:/usr/lib/aarch64-linux-gnu/libc-2.31.so:__libc_malloc /pid == XXX/ {
	printf("tid: %d => malloced(): %p", tid, retval);
	print(ustack());
}

注意，笔者是在64位ARM系统上进行跟踪调试的。上面的脚本指定了libc.so动态库的绝对路径，并指定了跟踪探测的函数：__libc_malloc（可增加偏移量，此处为函数入口），亦即malloc内存分配函数。通过uretprobe可以获得函数的返回指针，两次探测都会输出相应的线程号tid，以及应用的调用栈。将XXX修改为被调试应用的PID之后（或者修改为/pid == $1/，并在命令行上通过positional argument传递给脚本），以root权限执行该bpftrace脚本，在另一终端按任意键唤醒被跟踪应用，可以得到以下结果：

# cat memory.bt
#!/usr/bin/bpftrace
uprobe:/usr/lib/aarch64-linux-gnu/libc-2.31.so:__libc_malloc /pid == 2712/ {
	printf("tid: %d => malloc'ing(%x)...", tid, arg0);
	print(ustack());
}
uretprobe:/usr/lib/aarch64-linux-gnu/libc-2.31.so:__libc_malloc /pid == 2712/ {
	printf("tid: %d => malloced(): %p", tid, retval);
	print(ustack());
}
# bpftrace ./memory.bt
Attaching 2 probes...
tid: 2712 => malloc'ing(2022)...
        0x7f8f230aa0
        0x5560910994
        0x55609109b0
        0x55609109cc
        0x5560910a8c
        0x7f8f1d8218
        0x5560910884

tid: 2712 => malloced(): 0x5568a7eac0
        0x5560910978
        0x5560910994
        0x55609109b0
        0x55609109cc
        0x5560910a8c
        0x7f8f1d8218
        0x5560910884

可见，malloc的探测uprobe及返回探测uretprobe的调用栈回溯是一致的。若应用每次运行，应用及动态库加载的地址是相同的，根据该栈回溯结合GDB调试工具可以得到应用中分配内存的具体功能模块，从而深入了解应用的行为。相应的，被跟踪应用的输出结果与上面的结果一致，分配的内存指针为0x5568a7eac0：

$ ./memory 
	/pid == 2712/
Press any key to continue... 
You pressed 
malloc has returned: 0x5568a7eac0

若同时对free函数进行跟踪，编写特定的脚本处理bpftrace的输出结果，进行统计分析，就可以确定进程申请后未释放的内存的大小、调用栈回溯了。

Linux

更多推荐

Linux虚拟文件系统之文件系统卸载（sys_umount())

Linux中卸载文件系统由umount系统调用实现，入口函数为sys_umount()。较于文件系统的安装较为简单，下面是具体的实现。1. /*sys_umont系统调用*/2. SYSCALL_DEFINE2(umount, char __user *, name, int, flags)3. {4.struct path path;

Linux

网卡速率和双工模式的配置

http://linux.chinaitlab.com/system/792187.html1、mii-tool 配置网络设备协商方式的工具； 1.1 mii-tool 介绍； mii-tool - view, manipulate media-independent interface status （mii-tool 是查看，管理介质的网络接口的状态）

Linux

Linux系统下超级终端Minicom的使用方法（例如：连接交换机，路由器等）转http://baike.baidu.com/view/2911642.htm?fr=ala0_1

Linux系统下超级终端Minicom的使用方法 　　Linux下的Minicom的功能与下的超级终端功能相似，适于在通过超级终端对设备的管理以及对嵌入操作系统的升级，现写出Minicom的使用手册： 　　1．启动minicom 　　以root权限登录系统 　　使用命令 　　minicom –s 则minicom启动，屏

Linux

所有评论(0)

查看更多评论

塵觴葉

@yeholmes

已为社区贡献2条内容