trace系列4 - kprobe学习笔记
1.前言本文主要是根据阅码场 《Linux内核tracers的实现原理与应用》视频课程在aarch64上的实践。通过观察钩子函数的创建过程以及替换过程,理解trace的原理。本文同样以blk_update_request函数为例进行说明functiontrace kprobe的工作原理,此处的kprobe是基于function trace来实现。kernel版本:5.10平台:arm642. fu
0.前言
本文主要是根据阅码场 《Linux内核tracers的实现原理与应用》视频课程在aarch64上的实践。通过观察钩子函数的创建过程以及替换过程,理解trace的原理。本文同样以blk_update_request函数为例进行说明kprobe的工作原理,此处的kprobe是基于trace event来实现,同时使用了ftrace的框架。
首先让我们大致了解一下trace point、trace event、kprobe之间的关联:
-
tracepoint会被放到代码中,需要提供一个probe函数与之关联,当tracepoint被打开时,当tracepoint执行时,提供的probe函数会被调用,probe函数返回时会继续原来函数的执行。使用trace point有3个步骤:
(1) 在头文件include/trace/events/subsys.h中通过DECLARE_TRACE宏添加tracepoint声明
(2) 在系统文件subsys/file.c中通过DEFINE_TRACE创建trace point
(3)通过register_trace_subsys_eventname将tracepoint与probe关联 -
trace event是建立在tracepoint的基础之上,它可以通过一个宏来实现如上前两步骤,并定义注册和注销函数,通过echo 1 > events/subsys_event来执行注册。从内核4.0及以后都将鼓励使用trace event,不再提倡直接使用trace point。这个可以从 tracepoint的相关sample示例从3.9移除可以得出。使用Tracing event不用像tracingpoint那样需要自己定义probe函数,而且这些probe函数往往要通过模块的方式进行定义,然后加载,而Tracing event提供了TRACE_EVENT宏,可以通过复杂宏帮助定义统一格式的probe函数,而Tracing event需要用户指定trace 信息以何种格式存放到ring buffer中,trace信息将以何种格式打印。
-
kprobe可以理解为动态的trace event,可以在除了__kprobes/nokprobe_inline annotation 和那些标记为 NOKPROBE_SYMBOL的任何函数设置trace event。使用前需要打开内核选项:CONFIG_KPROBE_EVENTS=y.。
kprobe主要有两种使用方法,一是通过模块加载;二是通过debugfs接口。
(1)模块加载的方式:以内核的kprobe_example为例。首先声明一个kprobe结构体,然后定义其中几个关键成员变量,包括symbol_name,pre_handler,post_handler。然后通过register_kprobe函数注册kprobe即可。将kprobe_example.ko inmod进内核之后,每当系统新启动一个进程,比如执行ls,cat等,都会执行pre_handler和post_handler回调。
(2) 通过debugfs接口的方式:可以通过/sys/kernel/debug/tracing/kprobe_events来增加kprobe跟踪点,然后通过写入/sys/kernel/debug/tracing/events/kprobes//enabled使能。
kernel版本:5.10
平台:arm64
1. kprobe的总体原理
注:如下参考自kprobe原理解析(二)
kprobe的工作过程大致如下:
- 注册kprobe。 注册的每个kprobe对应一个kprobe结构体,该结构中记录着插入点(位置),以及该插入点本来对应的指令original_opcode;
- 替换原有指令。 使能kprobe的时候,将插入点位置的指令替换为一条异常(BRK)指令,这样当CPU执行到插入点位置时会陷入到异常态;
- 执行pre_handler。 进入异常态后,首先执行pre_handler,然后利用CPU提供的单步调试(single-step)功能,设置好相应的寄存器,将
下一条指令设置为插入点处本来的指令,从异常态返回;- 再次陷入异常态。 上一步骤中设置了single-step相关的寄存器,所以originnal_opcode刚一执行,便会二进宫:再次陷入异常态,此时将single-step 清除,并且执行post_handler,然后从异常态安全返回。
步骤2,3,4便是一次kprobe工作的过程,它的一个基本思路就是将本来执行一条指令扩展成执行 kprobe->pre_handler =>
指令 => kprobe–>post_hander这样三个过程。
2. kprobe领域模型
当kprobe创建后,形成如上的领域模型
-
trace_kprobe:既包含了kretprobe,又包含了trace_probe
-
kretprobe:用于描述kretprobe
-
trace_probe:用于描述kprobe, 它将与kretprobe公用kprobe结构,其中的args数组就保存了echo到节点的参数
-
kprobe: 用于描述kprobe的核心结构体,会连入全局哈希表kprobe_table
-
trace_probe_event:描述kprobe的trace event,包含了核心结构体trace_event_class和trace_event_call
-
trace_event_class:用于描述trace event的类
-
trace_event_call:是trace_event的封装,会连入全局ftrace_events链表
-
trace_event:主要关联了trace_event_functions结构体, trace_event_functions定义了trace_event的回调,trace_event会连入全局的ftrace_event_list
-
trace_event_functions:定义了trace_event的回调
-
trace_array:用于描述trace的最顶层的结构体,目前ftrace_trace_arrays只有一个全局的trace_array即global_trace,可以看出每个trace_event_call对应一个trace_array,trace_array->event_dir指向/sys/kernel/debug/tracing/events目录
-
trace_event_file: 管理kprobe trace event下所有的文件,通过event_call指向trace_event_call,通过system指向trace_subsystem_dir,通过tr指向trace_array,可见trace_event_file, trace_event_call,trace_array是一一对应的,trace_event_file通过list连入trace_array的events链表
-
trace_subsystem_dir: 管理kprobe trace event的目录,通过entry指向管理的目录节点(/sys/kernel/debug/tracing/events/kprobe),通过tr指向trace_array,通过list连入trace_array的systems链表。从上述图示可以看出,trace_subsystem_dir:本例中就表示events/kprobes目录
trace_event_file,trace_array, trace_event_call,trace_subsystem_dir一一对应
3. kprobe创建
kprobe的工作过程与前述function trace和function graph trace有所区别,但是kprobe仍然复用了ftrace的框架,在执行如下操作后,将执行probes_write
ubuntu@VM-0-9-ubuntu:~$ echo 'p:blk_update blk_update_request request=$arg1 status=$arg2:u8 bytes=$arg3:u32' > /sys/kernel/debug/tracing/kprobe_events
如上p表示kprobe,对于retkprobe则需要改成r;
blk_update为本次trace的名称,可以自己设置;
arg1表示需要跟踪的第一个参数,与其它几个参数一起保存到trace_probe.args数组中
static ssize_t probes_write(struct file *file, const char __user *buffer,size_t count, loff_t *ppos)
\--trace_parse_run_command(file, buffer, count, ppos,create_or_delete_trace_kprobe);
|--char *kbuf, *buf, *tmp
| size_t done = 0
| //分配空间用于存放event命令,本例中为:
| //'p:blk_update blk_update_request request=$arg1 status=$arg2:u8 bytes=$arg3:u32'
|--kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL)
| //将event命令从用户空间拷贝到内核空间
|--copy_from_user(kbuf, buffer + done, size))
|--buf = kbuf;
| //对event命令进行解析
\--trace_run_command(buf, create_or_delete_trace_kprobe)
| //对event参数以空格为分割符,存放到argv数组,此时每个argv一维数组就保存了一个命令参数
| //argv[0]:"p:blk_update", argv[1]:"blk_update_request", argv[2]:"request=$arg1",
| //argv[3]: "status=$arg2:u8",argv[4]:"bytes=$arg3:u32"
|--argv = argv_split(GFP_KERNEL, buf, &argc)
\--create_or_delete_trace_kprobe(argc, argv)
trace_parse_run_command主要用于解析kprobe命令,
(gdb) p *argv@20
$6 = {0xffff000007467700 "p:blk_update", 0xffff00000746770d "blk_update_request",
0xffff000007467720 "request=$arg1", 0xffff00000746772e "status=$arg2:u8",
0xffff00000746773e "bytes=$arg3:u32", 0x0, 0
.....
3.1 create_or_delete_trace_kprobe
create_or_delete_trace_kprobe(argc, argv)
|--trace_kprobe_create(argc, (const char **)argv)
|--struct trace_kprobe *tk = NULL;
| const char *event = NULL
| //初始化全局trace_probe_log结构体,包含子系统名、trace参数格式、参数个数
|--trace_probe_log_init("trace_kprobe", argc, argv)
| //获取到event名,本例为blk_update
|--event = strchr(&argv[0][1], ':');
| //将字符串blk_update_request转换为unsigned long
|--if (kstrtoul(argv[1], 0, (unsigned long *)&addr))
| trace_probe_log_set_index(1)
| //本例symbol为blk_update_request
| symbol = kstrdup(argv[1], GFP_KERNEL)
| //查询symbol入口地址偏移offset是否在符号表存在,此处是offset为0,检查blk_update_request符号是否存在
| if (kprobe_on_func_entry(NULL, symbol, offset))
| flags |= TPARG_FL_FENTRY
|--trace_probe_log_set_index(0);
| //解析event名称
|--traceprobe_parse_event_name(&event, &group, buf,event - argv[0])
| //分配trace_kprobe结构体
|--tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive,
| argc - 2, is_return);
| |--if (is_return)
| tk->rp.handler = kretprobe_dispatcher;
| else
| tk->rp.kp.pre_handler = kprobe_dispatcher
| //解析event命令行参数,结果保存在tk->tp->args数组中,其中:
| //args[i].name为“=”左侧部分,args[i].comm为“=”右侧部分
|--for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++)
| traceprobe_parse_probe_arg(&tk->tp, i, tmp, flags)
| // 设置trace event打印格式,执行完tk->tp->event->call->print_fmt格式为:
| //"(%lx) request=0x%Lx status=%u bytes=%u\", REC->__probe_ip, REC->request, REC->status, REC->bytes
|--traceprobe_set_print_fmt(&tk->tp, is_return);
| //注册kprobe event
\--register_trace_kprobe(tk)
create_or_delete_trace_kprobe中通过gdb可以查看到argv就是上述echo的部分,注册kprobe_event,注册trace_event_call,注册kprobe,关于这三者的关系,可参考 1.kprobe领域模型 部分。其中最主要的pre_handler为kprobe_dispatcher,同时设置了打印格式,并完成trace_kprobe的注册
-
alloc_trace_kprobe:为trace_kprobe分配空间,主要初始化了kprobe的pre_handler为kprobe_dispatcher,和post_handler
-
traceprobe_set_print_fmt:设置kprobe的打印格式
-
register_trace_kprobe:初始化trace_kprobe.,trace_probe., trace_probe_event, trace_event_call,并注册了trace_event, trace_event_call, kprobe,
3.1.1 register_trace_kprobe
register_trace_kprobe(tk)
| //注册kprobe event
|--register_kprobe_event(tk)
| | //初始化trace_event_call
| |--init_trace_event_call(tk)
| | |--if (trace_kprobe_is_return(tk))
| | | call->event.funcs = &kretprobe_funcs
| | | call->class->fields_array = kretprobe_fields_array;
| | | else
| | | call->event.funcs = &kprobe_funcs;
| | | call->class->fields_array = kprobe_fields_array
| | |--call->flags = TRACE_EVENT_FL_KPROBE
| | | //此函数将作为kprobe trace event使能/禁用的回调
| | \--call->class->reg = kprobe_register
| \--trace_probe_register_event_call(&tk->tp)
| |--struct trace_event_call *call = trace_probe_event_call(tp)
| | //注册trace_event
| |--register_trace_event(&call->event)
| | |--INIT_LIST_HEAD(&event->list)
| | |--event->funcs->raw = trace_nop_print;
| | | event->funcs->hex = trace_nop_print
| | | event->funcs->binary = trace_nop_print
| | \--hlist_add_head(&event->node, &event_hash[key]);
| | //注册trace_event_call
| \--trace_add_event_call(call)
| //注册kprobe
|-- __register_trace_kprobe(tk)
| |--if (trace_kprobe_is_return(tk))
| register_kretprobe(&tk->rp)
| else
| register_kprobe(&tk->rp.kp)
| | //通过kprobe的名字(blk_update_request)查符号表,得到符号地址,p为kprobe
| |--addr = kprobe_addr(p)
| | //此addr为blk_update_request函数
| |--p->addr = addr;
| | p->flags &= KPROBE_FLAG_DISABLED;
| | //因为在使能kprobe event时会在探测函数入口用brk替换原有指令,因此要保存原有指令
| | //首次不会注册kprobe,因此old_p 为空
| |--old_p = get_kprobe(p->addr)
| |--prepare_kprobe(p)
| |--INIT_HLIST_NODE(&p->hlist)
| |--hlist_add_head_rcu(&p->hlist, &kprobe_table[hash_ptr(p->addr, KPROBE_HASH_BITS)]);
\--dyn_event_add(&tk->devent)
\--list_add_tail(&ev->list, &dyn_event_list)
register_trace_kprobe:初始化trace_kprobe.,trace_probe., trace_probe_event, trace_event_call,并注册了trace_event, trace_event_call, kprobe
-
init_trace_event_call:初始化trace_kprobe.,trace_probe. trace_probe_event. trace_event_call
-
trace_probe_register_event_call: 分别调用了register_trace_event和trace_add_event_call。
(1)register_trace_event将trace_event注册到全局event_hash哈希链表;
(2)trace_add_event_call将trace_event_call注册到全局ftrace_events链表,trace_add_event_call创建trace_event_file,在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点 -
__register_trace_kprobe:注册kprobe,这里区分了retkprobe和kprobe,其中register_kprobe会将kprobe注册到全局kprobe_table哈希表中
(1)register_kprobe:完成了kprobe向全局kprobe_table添加哈希节点,完成kprobe的注册,它以插入的位置addr作为哈希值.
(1-1)prepare_kprobe:主要是为探测点触发以后如何返回到blk_update_request原始的指令做准备,p->opcode保存了blk_update_request原始的入口指令;p->ainsn.api.insn保存了blk_update_request原始的入口指令slot;p->ainsn.api.restore保存了原始入口指令的下一条指令的地址,这样当断点指令返回后就可以从这条指令执行,这样就可以沿着原始执行路径执行。
3.1.1.1 trace_add_event_call
trace_add_event_call(call)
|--__register_event(call, NULL)
| |--event_init(call)
| \--list_add(&call->list, &ftrace_events)
\--__add_event_to_tracers(call)
| //本例中ftrace_trace_arrays链表只有一个全局的global_trace, 此处tr为global_trace
|--list_for_each_entry(tr, &ftrace_trace_arrays, list)
__trace_add_new_event(call, tr);
|--struct trace_event_file *file
| //创建trace_event_file,初始化并连入tr->events链表
|--file = trace_create_new_event(call, tr)
| |--file->event_call = call;
| | file->tr = tr
| | atomic_set(&file->sm_ref, 0)
| | atomic_set(&file->tm_ref, 0)
| | INIT_LIST_HEAD(&file->triggers)
| |--list_add(&file->list, &tr->events)
\--event_create_dir(tr->event_dir, file)
trace_add_event_call创建trace_event_file,在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点
-
trace_create_new_event:创建trace_event_file,初始化并连入tr->events链表
-
event_create_dir:主要是在"sys/kernel/debug/tracing/events"目录下创建"kprobes"目录,并在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点
event_create_dir(tr->event_dir, file)
|--struct trace_event_call *call = file->event_call
| struct trace_array *tr = file->tr
| struct dentry *d_events
|--d_events = event_subsystem_dir(tr, call->class->system, file, parent)
| |--struct trace_subsystem_dir *dir;
| | struct event_subsystem *system;
| | struct dentry *entry
| | //为trace_subsystem_dir分配空间
| |--dir = kmalloc(sizeof(*dir), GFP_KERNEL)
| | //在"sys/kernel/debug/tracing/events"目录下创建"kprobes"目录
| |--dir->entry = tracefs_create_dir(name, parent)
| |--dir->tr = tr;
| | dir->ref_count = 1;
| | dir->nr_events = 1;
| | dir->subsystem = system;
| | file->system = dir;
| |--tracefs_create_file("filter"...)
| |--trace_create_file("enable"...)
| |--list_add(&dir->list, &tr->systems)
| //name为blk_update
|--name = trace_event_name(call)
| //在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录
|--file->dir = tracefs_create_dir(name, d_events)
|--trace_create_file("enable", 0644, file->dir, file,&ftrace_enable_fops);
|--trace_create_file("id", 0444, file->dir, ...)
|--event_define_fields(call)
|--trace_create_file("filter", 0644, file->dir, file,&ftrace_event_filter_fops)
|--trace_create_file("trigger", 0644, file->dir, file,&event_trigger_fops);
event_create_dir主要是在"sys/kernel/debug/tracing/events"目录下创建"kprobes"目录,并在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点
trace_create_file("enable", 0644, file->dir, file,&ftrace_enable_fops);
|--tracefs_create_file(name, mode, parent, data, fops)
|--struct dentry *dentry
| struct inode *inode;
| //返回kprobes目录对应的dentry
|--dentry = start_creating(name, parent)
|--inode = tracefs_get_inode(dentry->d_sb)
|--inode->i_mode = mode;
| inode->i_fop = fops ? fops : &tracefs_file_operations;
| //data就是trace_event_file
| inode->i_private = data
|-- d_instantiate(dentry, inode)
|--fsnotify_create(dentry->d_parent->d_inode, dentry);
trace_create_file在sys/kernel/debug/tracing/events/kprobes下创建enable文件,此处重点关注inode->i_private,它被初始化为trace_add_event_call时创建的trace_event_file,后面在event_enable_write时会用到
3.1.1.2 prepare_kprobe
|--prepare_kprobe(p)
\--arch_prepare_kprobe(p)
| //拷贝指令blk_update_request原有入口指令 sub sp, sp, #0x60
|--unsigned long probe_addr = (unsigned long)p->addr;
|--p->opcode = le32_to_cpu(*p->addr)
|--search_exception_tables(probe_addr)
| //p->ainsn.api.insn存放了blk_update_request的入口的下一条指令
| // stp x29, x30, [sp,#-32]! 和 brk #0x6
|--p->ainsn.api.insn = get_insn_slot()
| |--__get_insn_slot(struct kprobe_insn_cache *c)
\--arch_prepare_ss_slot(p)
|--kprobe_opcode_t *addr = p->ainsn.api.insn;
| //addr:stp x29, x30, [sp,#-32]!(blk_update_request的入口的下一条指令)
| //addr+1:brk #0x6
|--void *addrs[] = {addr, addr + 1};
| //p->opcod:sub sp, sp, #0x60
|--u32 insns[] = {p->opcode, BRK64_OPCODE_KPROBES_SS};
|--aarch64_insn_patch_text(addrs, insns, 2);
|--flush_icache_range((uintptr_t)addr, (uintptr_t)(addr + MAX_INSN_SIZE))
| //Needs restoring of return address after stepping xol
| //p->addr 为sub sp, sp, #0x60的地址0xffff8000104ec1f0
| //p->addr+4 为stp x29, x30, [sp,#16]的地址0xffff8000104ec1f4
\--p->ainsn.api.restore = (unsigned long) p->addr +sizeof(kprobe_opcode_t);
prepare_kprobe:主要是为探测点触发以后如何返回到blk_update_request原始的指令做准备,p->opcode保存了blk_update_request原始的入口指令;p->ainsn.api.insn保存了blk_update_request原始的入口指令slot;p->ainsn.api.restore保存l了原始入口指令的下一条指令的地址。
4. kprobe brk指令替换
先来看下未替换指令前blk_update_request的反汇编:
Dump of assembler code for function blk_update_request:
0xffff8000104ec1f0 <+0>: sub sp, sp, #0x60
0xffff8000104ec1f4 <+4>: stp x29, x30, [sp,#16]
0xffff8000104ec1f8 <+8>: add x29, sp, #0x10
0xffff8000104ec1fc <+12>: stp x19, x20, [sp,#32]
0xffff8000104ec200 <+16>: stp x21, x22, [sp,#48]
0xffff8000104ec204 <+20>: stp x23, x24, [sp,#64]
0xffff8000104ec208 <+24>: str x25, [sp,#80]
0xffff8000104ec20c <+28>: mov x22, x0
0xffff8000104ec210 <+32>: uxtb w24, w1
0xffff8000104ec214 <+36>: mov w21, w2
0xffff8000104ec218 <+40>: mov x0, x30
0xffff8000104ec21c <+44>: nop
......
在执行如下命令后
ubuntu@VM-0-9-ubuntu:~$ echo 1 > /sys/kernel/debug/tracing/events/kprobes/blk_update/enable
我们可以看到,在执行如上操作后,blk_update_request的入口处的指令
sub sp, sp, #0x60
被替换为:
0xffff8000104ec1f0 <+0>: brk #0x4
那么这个替换过程是怎么完成的呢?通过gdb可以跟踪过程
event_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,loff_t *ppos)
|--struct trace_event_file *file
|--kstrtoul_from_user(ubuf, cnt, 10, &val)
|--tracing_update_buffers()
| //前面在trace_create_file的时候会将trace_event_file保存在inode->i_private
|--file = event_file_data(filp)
|--ftrace_event_enable_disable(file, val)
| //此处以enable为1举例
|--__ftrace_event_enable_disable(file, enable, 0)
|--clear_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags)
|--trace_buffered_event_enable()
|--struct trace_event_call *call = file->event_call
| //call->class->reg(call, TRACE_REG_REGISTER, file),见register_kprobe_event
|--kprobe_register(struct trace_event_call *event,enum trace_reg type, void *data)
|--enable_trace_kprobe(event, file)
|--struct trace_probe *pos, *tp
|--tp = trace_probe_primary_from_call(call)
|-- trace_probe_add_file(tp, file);
| |--struct event_file_link *link
| |--link = kmalloc(sizeof(*link), GFP_KERNEL)
| |--list_add_tail_rcu(&link->list, &tp->event->files)
|--list_for_each_entry(pos, trace_probe_probe_list(tp), list)
tk = container_of(pos, struct trace_kprobe, tp)
if (trace_kprobe_has_gone(tk))
continue;
ret = __enable_trace_kprobe(tk)
if (ret) break;
enabled = true;
static inline int __enable_trace_kprobe(struct trace_kprobe *tk)
{
int ret = 0;
if (trace_kprobe_is_registered(tk) && !trace_kprobe_has_gone(tk)) {
if (trace_kprobe_is_return(tk))
ret = enable_kretprobe(&tk->rp);
else
ret = enable_kprobe(&tk->rp.kp);
}
return ret;
}
int enable_kprobe(struct kprobe *kp)
|--struct kprobe *p
|--p = __get_valid_kprobe(kp)
|--arm_kprobe(p)
| //Put a breakpoint for a probe
|--__arm_kprobe(kp);
| //arm kprobe: install breakpoint in text
|--arch_arm_kprobe(p)
|--void *addr = p->addr
|--u32 insn = BRK64_OPCODE_KPROBES;
|--aarch64_insn_patch_text(&addr, &insn, 1);
int __kprobes aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt)
{
struct aarch64_insn_patch patch = {
.text_addrs = addrs,
.new_insns = insns,
.insn_cnt = cnt,
.cpu_count = ATOMIC_INIT(0),
};
if (cnt <= 0)
return -EINVAL;
//stop_machine_cpuslocked会调用aarch64_insn_patch_text_cb回调,参数为&patch
return stop_machine_cpuslocked(aarch64_insn_patch_text_cb, &patch,
cpu_online_mask);
}
static int __kprobes aarch64_insn_patch_text_cb(void *arg)
{
int i, ret = 0;
struct aarch64_insn_patch *pp = arg;
/* The first CPU becomes master */
if (atomic_inc_return(&pp->cpu_count) == 1) {
for (i = 0; ret == 0 && i < pp->insn_cnt; i++)
ret = aarch64_insn_patch_text_nosync(pp->text_addrs[i],
pp->new_insns[i]);
/* Notify other processors with an additional increment. */
atomic_inc(&pp->cpu_count);
} else {
while (atomic_read(&pp->cpu_count) <= num_online_cpus())
cpu_relax();
isb();
}
return ret;
}
int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
u32 *tp = addr;
int ret;
/* A64 instructions must be word aligned */
if ((uintptr_t)tp & 0x3)
return -EINVAL;
ret = aarch64_insn_write(tp, insn);
if (ret == 0)
__flush_icache_range((uintptr_t)tp,
(uintptr_t)tp + AARCH64_INSN_SIZE);
return ret;
}
int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
return __aarch64_insn_write(addr, cpu_to_le32(insn));
}
static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
{
void *waddr = addr;
unsigned long flags = 0;
int ret;
raw_spin_lock_irqsave(&patch_lock, flags);
waddr = patch_map(addr, FIX_TEXT_POKE0);
ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);
patch_unmap(FIX_TEXT_POKE0);
raw_spin_unlock_irqrestore(&patch_lock, flags);
return ret;
}
5. kprobe钩子函数的执行
注:如下5.2~5.8是层层调用的关系,并非同级函数关系
5.1 断点异常回调函数初始化=>
# arch/arm64/include/asm/debug-monitors.h
/* AArch64 */
#define DBG_ESR_EVT_HWBP 0x0
#define DBG_ESR_EVT_HWSS 0x1
#define DBG_ESR_EVT_HWWP 0x2
#define DBG_ESR_EVT_BRK 0x6
# arch/arm64/mm/fault.c
/*
* __refdata because early_brk64 is __init, but the reference to it is
* clobbered at arch_initcall time.
* See traps.c and debug-monitors.c:debug_traps_init().
*/
static struct fault_info __refdata debug_fault_info[] = {
{ do_bad, SIGTRAP, TRAP_HWBKPT, "hardware breakpoint" },
{ do_bad, SIGTRAP, TRAP_HWBKPT, "hardware single-step" },
{ do_bad, SIGTRAP, TRAP_HWBKPT, "hardware watchpoint" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 3" },
{ do_bad, SIGTRAP, TRAP_BRKPT, "aarch32 BKPT" },
{ do_bad, SIGKILL, SI_KERNEL, "aarch32 vector catch" },
{ early_brk64, SIGTRAP, TRAP_BRKPT, "aarch64 BRK" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 7" },
};
void __init hook_debug_fault_code(int nr,
int (*fn)(unsigned long, unsigned int, struct pt_regs *),
int sig, int code, const char *name)
{
BUG_ON(nr < 0 || nr >= ARRAY_SIZE(debug_fault_info));
debug_fault_info[nr].fn = fn;
debug_fault_info[nr].sig = sig;
debug_fault_info[nr].code = code;
debug_fault_info[nr].name = name;
}
#arch/arm64/kernel/debug-monitors.c
void __init debug_traps_init(void)
{
hook_debug_fault_code(DBG_ESR_EVT_HWSS, single_step_handler, SIGTRAP,
TRAP_TRACE, "single-step handler");
hook_debug_fault_code(DBG_ESR_EVT_BRK, brk_handler, SIGTRAP,
TRAP_BRKPT, "BRK handler");
}
通过hook_debug_fault_code动态定义了异常处理的钩子函数brk_handler,它将在断点异常处理函数中被调用
下面我们就来关注下kprobe的执行流程,brk #0x4 会跳转到arch/arm64/kernel/entry.S的sync异常处理
5.2 brk #0x4 =>
//将栈大小扩容0x150,sp保存了栈帧顶地址
0xffff800010010a00 <vectors+512> sub sp, sp, #0x150
0xffff800010010a04 <vectors+516> add sp, sp, x0
0xffff800010010a08 <vectors+520> sub x0, sp, x0
0xffff800010010a0c <vectors+524> tbnz w0, #14, 0xffff800010010a1c <vectors+540>
0xffff800010010a10 <vectors+528> sub x0, sp, x0
0xffff800010010a14 <vectors+532> sub sp, sp, x0
0xffff800010010a18 <vectors+536> b 0xffff800010011940 <el1_sync>
brk断点异常触发后会执行不同的回调,后面的#0x4决定了调用断点异常处理函数的哪个回调
5.3 el1_sync =>
# arch/arm64/kernel/entry.S
SYM_CODE_START(vectors)
kernel_ventry 1, sync_invalid // Synchronous EL1t
kernel_ventry 1, irq_invalid // IRQ EL1t
kernel_ventry 1, fiq_invalid // FIQ EL1t
kernel_ventry 1, error_invalid // Error EL1t
//kprobe断点指令会跳转到此处执行
kernel_ventry 1, sync // Synchronous EL1h
kernel_ventry 1, irq // IRQ EL1h
kernel_ventry 1, fiq_invalid // FIQ EL1h
kernel_ventry 1, error // Error EL1h
kernel_ventry 0, sync // Synchronous 64-bit EL0
kernel_ventry 0, irq // IRQ 64-bit EL0
kernel_ventry 0, fiq_invalid // FIQ 64-bit EL0
kernel_ventry 0, error // Error 64-bit EL0
#ifdef CONFIG_COMPAT
kernel_ventry 0, sync_compat, 32 // Synchronous 32-bit EL0
kernel_ventry 0, irq_compat, 32 // IRQ 32-bit EL0
kernel_ventry 0, fiq_invalid_compat, 32 // FIQ 32-bit EL0
kernel_ventry 0, error_compat, 32 // Error 32-bit EL0
#else
kernel_ventry 0, sync_invalid, 32 // Synchronous 32-bit EL0
kernel_ventry 0, irq_invalid, 32 // IRQ 32-bit EL0
kernel_ventry 0, fiq_invalid, 32 // FIQ 32-bit EL0
kernel_ventry 0, error_invalid, 32 // Error 32-bit EL0
#endif
SYM_CODE_END(vectors)
brk #0x4 会触发arm64异常处理,进入异常会跳转到arch/arm64/kernel/entry.S的sync异常处理,此处会跳转到el1_sync
SYM_CODE_START_LOCAL_NOALIGN(el1_sync)
kernel_entry 1
//通过kernel_entry可知,x0指向保存的通用寄存器
mov x0, sp
bl el1_sync_handler
kernel_exit 1
SYM_CODE_END(el1_sync)
关于kernel_entry,我们可以看下它的反汇编:
//保存通用寄存器x0~x29
0xffff800010011940 <el1_sync>: stp x0, x1, [sp]
0xffff800010011944 <el1_sync+4>: stp x2, x3, [sp,#16]
0xffff800010011948 <el1_sync+8>: stp x4, x5, [sp,#32]
0xffff80001001194c <el1_sync+12>: stp x6, x7, [sp,#48]
0xffff800010011950 <el1_sync+16>: stp x8, x9, [sp,#64]
0xffff800010011954 <el1_sync+20>: stp x10, x11, [sp,#80]
0xffff800010011958 <el1_sync+24>: stp x12, x13, [sp,#96]
0xffff80001001195c <el1_sync+28>: stp x14, x15, [sp,#112]
0xffff800010011960 <el1_sync+32>: stp x16, x17, [sp,#128]
0xffff800010011964 <el1_sync+36>: stp x18, x19, [sp,#144]
0xffff800010011968 <el1_sync+40>: stp x20, x21, [sp,#160]
0xffff80001001196c <el1_sync+44>: stp x22, x23, [sp,#176]
0xffff800010011970 <el1_sync+48>: stp x24, x25, [sp,#192]
0xffff800010011974 <el1_sync+52>: stp x26, x27, [sp,#208]
0xffff800010011978 <el1_sync+56>: stp x28, x29, [sp,#224]
0xffff80001001197c <el1_sync+60>: add x21, sp, #0x150
//x28保存了当前进程描述符指针
0xffff800010011980 <el1_sync+64>: mrs x28, sp_el0
0xffff800010011984 <el1_sync+68>: ldr x20, [x28,#8]
0xffff800010011988 <el1_sync+72>: str x20, [sp,#288]
0xffff80001001198c <el1_sync+76>: mov x20, #0xffffffffffff // #281474976710655
0xffff800010011990 <el1_sync+80>: str x20, [x28,#8]
0xffff800010011994 <el1_sync+84>: mrs x22, elr_el1
0xffff800010011998 <el1_sync+88>: mrs x23, spsr_el1
//lr入栈
0xffff80001001199c <el1_sync+92>: stp x30, x21, [sp,#240]
//elr入栈
0xffff8000100119a0 <el1_sync+96>: stp x29, x22, [sp,#304]
//x29指向栈顶
0xffff8000100119a4 <el1_sync+100>: add x29, sp, #0x130
//spsr入栈
0xffff8000100119a8 <el1_sync+104>: stp x22, x23, [sp,#256]
0xffff8000100119ac <el1_sync+108>: nop
0xffff8000100119b0 <el1_sync+112>: nop
kernel_entry宏参数为1表示保存发生在EL1的异常现场;若为0表示保存发生在EL0的异常现场。通过如上对kernel_entry宏的分析可知,发生异常进程的现场上下文会被保存在发生异常的进程内核栈,这个异常的现场主要包括:栈帧、PSTATE、LR、SP以及通用寄存器X0~X29。之后将跳转到el1_sync_handler
5.4 el1_sync_handler =>
asmlinkage void noinstr el1_sync_handler(struct pt_regs *regs)
{
unsigned long esr = read_sysreg(esr_el1);
//通过esr可以判断出异常类型
switch (ESR_ELx_EC(esr)) {
case ESR_ELx_EC_DABT_CUR:
case ESR_ELx_EC_IABT_CUR:
el1_abort(regs, esr);
break;
/*
* We don't handle ESR_ELx_EC_SP_ALIGN, since we will have hit a
* recursive exception when trying to push the initial pt_regs.
*/
case ESR_ELx_EC_PC_ALIGN:
el1_pc(regs, esr);
break;
case ESR_ELx_EC_SYS64:
case ESR_ELx_EC_UNKNOWN:
el1_undef(regs);
break;
case ESR_ELx_EC_BREAKPT_CUR:
case ESR_ELx_EC_SOFTSTP_CUR:
case ESR_ELx_EC_WATCHPT_CUR:
case ESR_ELx_EC_BRK64:
el1_dbg(regs, esr);
break;
case ESR_ELx_EC_FPAC:
el1_fpac(regs, esr);
break;
default:
el1_inv(regs, esr);
}
}
esr_el1为arm64异常综合信息寄存器,其中bit31-26为异常类型(EC), bit24-0为具体的异常指令编码(ISS),对于esr寄存器不同的异常类型EC有不同的ISS表(bit24-0)。此处,kprobe中的断点指令brk所引发异常时ESR_EL1的值为:0xf2000004,由于kprobe时插入的是brk断点指令,对应的EC值为0x3c(ESR_ELx_EC_BRK64),因此会跳转到el1_dbg执行
关于ESR_EL1的寄存器定义可以参考ARMV8 ARM
不同的异常类型EC有不同的ISS表,对于断点debug异常对应的ISS表结构如下,ISS表中Comment值不同,又进一步区分不同的debug异常
5.5 el1_dbg=>
//esr Holds syndrome information for an exception taken to EL1
static void noinstr el1_dbg(struct pt_regs *regs, unsigned long esr)
{
//far为Fault Address Register,它保存了引起异常发生的指令地址
unsigned long far = read_sysreg(far_el1);
/*
* The CPU masked interrupts, and we are leaving them masked during
* do_debug_exception(). Update PMR as if we had called
* local_daif_mask().
*/
if (system_uses_irq_prio_masking())
gic_write_pmr(GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET);
arm64_enter_el1_dbg(regs);
do_debug_exception(far, esr, regs);
arm64_exit_el1_dbg(regs);
}
el1_dbg会调用do_debug_exception处理debug异常.
void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
struct pt_regs *regs)
{
const struct fault_info *inf = esr_to_debug_fault_info(esr);
unsigned long pc = instruction_pointer(regs);
if (cortex_a76_erratum_1463225_debug_handler(regs))
return;
debug_exception_enter(regs);
if (user_mode(regs) && !is_ttbr0_addr(pc))
arm64_apply_bp_hardening();
//由初始化时候debug_traps_init,以及esr_el1的值可知inf->fn为brk_handler
if (inf->fn(addr_if_watchpoint, esr, regs)) {
arm64_notify_die(inf->name, regs,
inf->sig, inf->code, (void __user *)pc, esr);
}
debug_exception_exit(regs);
}
NOKPROBE_SYMBOL(do_debug_exception);
esr_el1的bit27~bit29指示了debug异常类型,对应debug_fault_info数组的索引,此处可知debug异常类型为0x6,对应DBG_ESR_EVT_BRK,由初始化函数debug_traps_init可知inf->fn为brk_handler,此处的addr_if_watchpoint为引发断点指令的地址。我们通过gdb查看pt_gets结构体,可知函数中 pc变量值就是blk_update_request中插入的断点指令brk 0x4的地址。
static int brk_handler(unsigned long unused, unsigned int esr,
struct pt_regs *regs)
{
if (call_break_hook(regs, esr) == DBG_HOOK_HANDLED)
return 0;
if (user_mode(regs)) {
send_user_sigtrap(TRAP_BRKPT);
} else {
pr_warn("Unexpected kernel BRK exception at EL1\n");
return -EFAULT;
}
return 0;
}
NOKPROBE_SYMBOL(brk_handler);
brk_handler会调用call_break_hook,它实际是根据具体的某种断点异常类型来回调不同的hook,主要是根据ESR_EL1.ISS.Comment进行区分,也就是不同的ESR_EL1.ISS.Comment对应不同的hook。
5.6 call_break_hook=>
static int call_break_hook(struct pt_regs *regs, unsigned int esr)
{
struct break_hook *hook;
struct list_head *list;
int (*fn)(struct pt_regs *regs, unsigned int esr) = NULL;
//通过user_mode(regs)可知是发生在el1模式,因此list为kernel_break_hook
list = user_mode(regs) ? &user_break_hook : &kernel_break_hook;
/*
* Since brk exception disables interrupt, this function is
* entirely not preemptible, and we can use rcu list safely here.
*/
list_for_each_entry_rcu(hook, list, node) {
//#define ESR_ELx_BRK64_ISS_COMMENT_MASK 0xffff
unsigned int comment = esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
if ((comment & ~hook->mask) == hook->imm)
fn = hook->fn;
}
return fn ? fn(regs, esr) : DBG_HOOK_ERROR;
}
NOKPROBE_SYMBOL(call_break_hook);
由于在el1模式下发生的断点异常,因此此处list被赋值为kernel_break_hook,关于kernel_break_hook我们可以参看如下初始化相关代码,在初始化时register_kernel_break_hook会向kernel_break_hook链表注册不同的hook,这包括kprobes_break_hook和kprobes_break_ss_hook。list_for_each_entry_rcu(hook, list, node)主要通过遍历kernel_break_hook链表,根据debug断点异常类型找到匹配的hook。其中异常类型主要根据esr_el1.ISS.Comment进行判断,此处由于esr_el1为0xf2000004,因此esr_el1.ISS为0x004,Comment为0x004, 因此会调用到kprobes_break_hook.fn回调kprobe_breakpoint_handler
# arch/arm64/include/asm/brk-imm.h
/*
* #imm16 values used for BRK instruction generation
* 0x004: for installing kprobes
* 0x005: for installing uprobes
* 0x006: for kprobe software single-step
* Allowed values for kgdb are 0x400 - 0x7ff
* 0x100: for triggering a fault on purpose (reserved)
* 0x400: for dynamic BRK instruction
* 0x401: for compile time BRK instruction
* 0x800: kernel-mode BUG() and WARN() traps
* 0x9xx: tag-based KASAN trap (allowed values 0x900 - 0x9ff)
*/
#define KPROBES_BRK_IMM 0x004
#define UPROBES_BRK_IMM 0x005
#define KPROBES_BRK_SS_IMM 0x006
#define FAULT_BRK_IMM 0x100
#define KGDB_DYN_DBG_BRK_IMM 0x400
#define KGDB_COMPILED_DBG_BRK_IMM 0x401
#define BUG_BRK_IMM 0x800
#define KASAN_BRK_IMM 0x900
#define KASAN_BRK_MASK
# arch/arm64/kernel/debug-monitors.c
static LIST_HEAD(kernel_break_hook);
static struct break_hook kprobes_break_hook = {
.imm = KPROBES_BRK_IMM,
.fn = kprobe_breakpoint_handler,
};
void register_kernel_break_hook(struct break_hook *hook)
{
register_debug_hook(&hook->node, &kernel_break_hook);
}
int __init arch_init_kprobes(void)
{
register_kernel_break_hook(&kprobes_break_hook);
register_kernel_break_hook(&kprobes_break_ss_hook);
return 0;
}
此处我们简单的总结一下:
- ESR_EL1.EC指示了异常类型,此处ESR_EL1.EC值为0x3c,表示ESR_ELx_EC_BRK64断点异常类;
- ESR_EL1.EC的bit27~bit29进一步指示了断点异常类型,包括:
DBG_ESR_EVT_HWBP
DBG_ESR_EVT_HWSS
DBG_ESR_EVT_HWWP
DBG_ESR_EVT_BRK
debug_fault_info数组维护着不同的断点异常,ESR_EL1.EC的bit27~bit29对应debug_fault_info数组的索引,此处断点异常类型为0x6,对应DBG_ESR_EVT_BRK,由初始化函数debug_traps_init可知inf->fn为brk_handler - ESR_EL1.ISS.Comment进一步指示了某种具体的断点异常hook,此处esr_el1.ISS为0x004,Comment为0x004, 因此会调用到kprobes_break_hook.fn回调, 即kprobe_breakpoint_handler, kprobe_breakpoint_handler如下:
kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr)
{
kprobe_handler(regs);
return DBG_HOOK_HANDLED;
}
5.7 kprobe_handler
static void __kprobes kprobe_handler(struct pt_regs *regs)
|--struct kprobe *p, *cur_kprobe;
| struct kprobe_ctlblk *kcb;
| //获取被probe点的pc
|--unsigned long addr = instruction_pointer(regs);
| //addr地址为blk_update_request的入口地址,它为哈希表索引
|--p = get_kprobe((kprobe_opcode_t *) addr);
\--if (p)
if (!p->pre_handler || !p->pre_handler(p, regs))
setup_singlestep(p, regs, kcb, 0);
进入kprobe_handler,通过跟踪点的地址作为哈希值,通过get_kprobe获取的kprobe如下,这个kprobe就是通过如下指令注册的kprobe:
ubuntu@VM-0-9-ubuntu:~$ echo 'p:blk_update blk_update_request request=$arg1 status=$arg2:u8 bytes=$arg3:u32' > /sys/kernel/debug/tracing/kprobe_events
可以看到p->pre_handler为kprobe_dispatcher
(gdb) p *(struct kprobe *)0xffff0000070b5618
$7 = {
hlist = {
next = 0x0,
pprev = 0xffff80001203a410 <kprobe_table+208>
},
list = {
next = 0xffff0000070b5628,
prev = 0xffff0000070b5628
},
nmissed = 0,
addr = 0xffff8000104ec1f0 <blk_update_request>,
symbol_name = 0xffff00000758b200 "blk_update_request",
offset = 0,
pre_handler = 0xffff8000101b1354 <kprobe_dispatcher>,
post_handler = 0x0,
fault_handler = 0x0,
opcode = 3506537471,
ainsn = {
api = {
insn = 0xffff800012533000,
pstate_cc = 0x0,
handler = 0x0,
restore = 18446603336494793204
}
},
flags = 0
}
5.7.1 kprobe_dispatcher
kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs)
| //由当前的kprobe获取到trace_kprobe
|--struct trace_kprobe *tk = container_of(kp, struct trace_kprobe, rp.kp);
|--if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE))
| kprobe_trace_func(tk, regs)
#ifdef CONFIG_PERF_EVENTS
\--if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE))
ret = kprobe_perf_func(tk, regs);
#endif
kprobe_trace_func(tk, regs)
|--struct event_file_link *link;
\--trace_probe_for_each_link_rcu(link, &tk->tp)
__kprobe_trace_func(tk, regs, link->file)
|--struct kprobe_trace_entry_head *entry;
| struct trace_event_call *call = trace_probe_event_call(&tk->tp);
| struct trace_event_buffer fbuffer
|--if (trace_trigger_soft_disabled(trace_file))
| return;
|--fbuffer.pc = preempt_count();
|--fbuffer.trace_file = trace_file;
| fbuffer.event = trace_event_buffer_lock_reserve(&fbuffer.buffer, trace_file,
| call->event.type,
| sizeof(*entry) + tk->tp.size + dsize,
| fbuffer.flags, fbuffer.pc);
| fbuffer.regs = regs;
| entry = fbuffer.entry = ring_buffer_event_data(fbuffer.event);
| entry->ip = (unsigned long)tk->rp.kp.addr;
|--store_trace_args(&entry[1], &tk->tp, regs, sizeof(*entry), dsize);
| //将kprobe打印放入trace buffer
\--trace_event_buffer_commit(&fbuffer);
- trace_probe_for_each_link_rcu:根据前述enable_trace_kprobe,此函数会为enable文件节点创建event_file_link,它会连入trace_probe_event的files链表,此处通过trace_probe_for_each_link_rcu来遍历链表,执行__kprobe_trace_func(tk, regs, link->file),可以看到__kprobe_trace_func中执行的动作就是trace_event的probe执行的操作,从这里可以看出kprobe和trace event的不同之处在于触发执行probe回调的方式不同,kprobe是通过断点指令异常中触发其trace event的probe回调,而trace event是通过在函数的固定位置触发probe回调,且kprobe的参数输出格式是动态设定并解析的,而trace event格式是静态设定的
5.7.2 setup_singlestep
setup_singlestep(p, regs, kcb, 0)
|--unsigned long slot;
|--kcb->kprobe_status = KPROBE_HIT_SS;
|--if (p->ainsn.api.insn)
//slot存放了blk_update_request的入口指令:sub sp, sp, #0x60
slot = (unsigned long)p->ainsn.api.insn;
set_ss_context(kcb, slot);
|--kcb->ss_ctx.ss_pending = true;
| //slot(kcb->ss_ctx.match_addr)同时存放了指令: brk #0x6
|--kcb->ss_ctx.match_addr = addr + sizeof(kprobe_opcode_t);
kprobes_save_local_irqflag(kcb, regs);
instruction_pointer_set(regs, slot);
| //将regs->pc赋值为val, 此处val就是slot, 它对应指令为sub sp, sp, #0x60
|--regs->pc = val
instruction_pointer_set设置了当断点指令返回执行的pc值,它就是blk_update_request原始的入口指令,当断点指令异常返回后,将执行blk_update_request的原始入口指令(注意:它位于另一个内存地址p->ainsn.api.insn,非原始内存地址)。由于slot槽同时还有一条端点指令brk #0x6,因此会继续执行断点指令brk #0x6
0xffff800012533000 sub sp, sp, #0x60
0xffff800012533004 brk #0x6
0xffff800012533008 stp x29, x30, [sp,#-32]!
0xffff80001253300c brk #0x6
0xffff800012533010 .inst 0x00000000 ; undefined
......
5.8 brk #0x6=>
#arch/arm64/kernel/probes/kprobes.c
static struct break_hook kprobes_break_ss_hook = {
.imm = KPROBES_BRK_SS_IMM,
.fn = kprobe_breakpoint_ss_handler,
};
int __init arch_init_kprobes(void)
{
register_kernel_break_hook(&kprobes_break_hook);
register_kernel_break_hook(&kprobes_break_ss_hook);
return 0;
}
同前面 brk #0x4 执行过程类似, arch_init_kprobes时会执行注册kprobes_break_ss_hook,它定义了用于断点的单步执行回调函数kprobe_breakpoint_ss_handler
static int __kprobes
kprobe_breakpoint_ss_handler(struct pt_regs *regs, unsigned int esr)
{
struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
int retval;
/* return error if this is not our step */
retval = kprobe_ss_hit(kcb, instruction_pointer(regs));
if (retval == DBG_HOOK_HANDLED) {
kprobes_restore_local_irqflag(kcb, regs);
post_kprobe_handler(kcb, regs);
}
return retval;
}
当触发brk #0x6断点异常后,将执行如下的异常处理路径:
brk #0x6 => el1_sync => el1_sync_handler => el1_dbg=> call_break_hook
call_break_hook会根据imm码遍历所有的断点处理回调,此处imm码就是0x6(KPROBES_BRK_SS_IMM),因此将会执行断点单步回调函数kprobe_breakpoint_ss_handler,其中会调用post_kprobe_handler
post_kprobe_handler(struct kprobe_ctlblk *kcb, struct pt_regs *regs)
|--struct kprobe *cur = kprobe_running();
|--if (cur->ainsn.api.restore != 0)
| //用cur->ainsn.api.restore来恢复pc值
| instruction_pointer_set(regs, cur->ainsn.api.restore)
| |--regs->pc = val;
|--if (kcb->kprobe_status == KPROBE_REENTER)
restore_previous_kprobe(kcb);
instruction_pointer_set:用cur->ainsn.api.restore来恢复pc值,cur->ainsn.api.restore实际就是注册register_kprobe中预先初始化好的,它就是blk_update_request入口指令的下一条指令:stp x29, x30, [sp,#16],这样当brk异常返回时,将继续沿着blk_update_request的第二条指令运行
6. 总结
我们再来简单总结kprobe的工作流程:
-
首先要注册kprobe
这主要是通过向/sys/kernel/debug/tracing/kprobe_events节点写入命令完成,这个过程将会:
(1)完成kprobe的注册,这其中最重要的是初始化pre_handler回调,它将在brk #0x4断点处理函数中被调用,执行kprobe的主要功能;
(2)同时会保存被探测函数探测点的原始指令,再加上一条brk #0x6断点指令,一起被保存到slot中,将来被替换的brk #0x4返回后将首先执行此slot中的指令代码;
(3)同时也会记录探测点的后一条指令地址,将来从brk #0x6返回时将执行此指令,从而恢复原始的指令执行路径; -
断点指令插入
主要通过echo 1 > /sys/kernel/debug/tracing/events/kprobes/blk_update/enable完成。它将会将被探测函数探测点的指令替换为brk #0x4。
注:brk #0x4和brk #0x6将对应不同的断点处理回调 -
执行kprobe回调
当进入被探测函数探测点时,会执行brk断点指令引发断点异常,根据0x4参数将执行断点立即处理回调,最终将执行pre_handler回调,完成kprobe功能;之后将执行第一步初始化好的slot槽中的指令,slot槽的第一条指令就是被探测函数原始执行的指令,之后将执行brk #0x6再次陷入断点异常,此时根据参数0x6将执行断点单步异常处理函数,它将会通过将第1步(3)中记录的指令地址恢复PC,这样brk #0x6返回时,将继续沿着被探测函数探测点之后的指令路径执行,恢复正常的指令执行路径。
从上面的分析可以看出,kprobe基于trace event,与trace event的不同在于,kprobe是通过断点指令异常中触发其trace event的probe回调,而trace event是通过在函数的固定位置触发probe回调,且kprobe的参数输出格式是动态设定并解析的,而trace event格式是静态设定的
执行结果如下:
/ # cat /sys//kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 6/6 #P:2
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
ksoftirqd/0-9 [000] d.s1 37.846957: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006ce2140 status=0 bytes=1024
ksoftirqd/0-9 [000] d.s1 41.417047: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006c44280 status=0 bytes=3072
<idle>-0 [000] d.s2 41.419396: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006ad8000 status=0 bytes=0
<idle>-0 [000] d.s3 41.419896: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006c463c0 status=0 bytes=1024
<idle>-0 [000] d.s2 41.421701: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006ad8000 status=0 bytes=0
<idle>-0 [000] d.s3 41.421729: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006c463c0 status=0 bytes=0
参考文档
Kernel调试追踪技术之 Kprobe on ARM64
https://blog.csdn.net/whatday/article/details/100511447
Linux TraceEvent - 我见过的史上最长宏定义
kprobe原理解析(一)
kprobe原理解析(二)
附录
主要数据结构:
struct kprobe {
struct hlist_node hlist;
/* list of kprobes for multi-handler support */
struct list_head list;
/*count the number of times this probe was temporarily disarmed */
unsigned long nmissed;
/* location of the probe point */
kprobe_opcode_t *addr;
/* Allow user to indicate symbol name of the probe point */
const char *symbol_name;
/* Offset into the symbol */
unsigned int offset;
/* Called before addr is executed. */
kprobe_pre_handler_t pre_handler;
/* Called after addr is executed, unless... */
kprobe_post_handler_t post_handler;
/*
* ... called if executing addr causes a fault (eg. page fault).
* Return 1 if it handled fault, otherwise kernel will see it.
*/
kprobe_fault_handler_t fault_handler;
/* Saved opcode (which has been replaced with breakpoint) */
kprobe_opcode_t opcode;
/* copy of the original instruction */
struct arch_specific_insn ainsn;
/*
* Indicates various status flags.
* Protected by kprobe_mutex after this kprobe is registered.
*/
u32 flags;
};
更多推荐
所有评论(0)