一、简介
1. Android 从9.0版本开始全面支持eBPF,其主要用在流量统计上。此外,eBPF可以与内核的 kprobe/tracepoints/skfilter 等模块相结合,hook内核事件从而监控相应的系统状态。
二、bpf服务启动与程序加载
1. Android 为 eBPF 提供了许多封装的库,并设计了 eBPF 加载器 bpfloader,主要模块如下:
(1) bpfloader: [/system/bin/bpfloader] 系统启动时负责加载 /system/etc/bpf 目录下的 eBPF 目标文件。
(2) libbpf_android: [system/bpf/libbpf_android] 会生成 libbpf_android.so, 提供创建 bpf map 容器、加载 bpf 目标文件的接口。
(3) libbpf [external/libbpf] 会生成 libbpf_minimal.so 封装了bpf的系统调用, 提供如 attach/deattach 等bpf操作相关api
2. 大致流程为,开机 init 初始化 fs 流程完成后,启动 bpfloader 服务,扫描并解析 /system/etc/bpf/*.o 这些 elf 格式的 bpf 目标文件,读取并校验 critical、license、bpfloader* 相关 section 数据;
读取 progs、maps section 中的所有 bpf prog/map 数据,并将其按 文件+符号 名格式映射到 /sys/fs/bpf/ 下特定文件节点,完成bpf加载。
//system/core/rootdir/init.rc
on late-init
trigger load_bpf_programs
//system/bpf/bpfloader/bpfloader.rc
on load_bpf_programs
write /proc/sys/kernel/unprivileged_bpf_disabled 0
write /proc/sys/net/core/bpf_jit_enable 1 //开启及时编译功能
write /proc/sys/net/core/bpf_jit_kallsyms 1
exec_start bpfloader //启动 bpfloader
service bpfloader /system/bin/bpfloader
capabilities CHOWN SYS_ADMIN NET_ADMIN
rlimit memlock 1073741824 1073741824
oneshot
reboot_on_failure reboot,bpfloader-failed
updatable
bpfloader 的实现文件是 system/bpf/bpfloader/BpfLoader.cpp,其主要逻辑:
const Location locations[] = {
...
{
.dir = "/apex/com.android.tethering/etc/bpf/net_private/", //表示从这个路径下加载bpf .o 类型的elf文件
.prefix = "net_private/", //加载后形成的prog/map文件句柄存放在 /sys/fs/bpf/net_private 目录下
.allowedDomainBitmask = kTetheringApexDomainBitmask,
},
// Core operating system
{
.dir = "/system/etc/bpf/",
.prefix = "",
.allowedDomainBitmask = domainToBitmask(domain::platform),
},
...
};
int main(int argc, char** argv) {
/* 在 /sys/fs/bpf 目录下创建需要创建的所有子目录 */
for (const auto& location : locations) {
createSysFsBpfSubDir(location.prefix);
}
// Load all ELF objects, create programs and maps, and pin them
/*
* 加载所有的.o类型的bpf elf文件,会去读取 “critical” “license” section,
* 并检查 bpfloader 的版本号,和看map和prog中的大小是否匹配。然后读取"progs"
* section对应的bpf elf代码。然后调用 bpf() 系统调用加载到内核中。
* 最终elf程序也会绑定到 /sys/fs/bpf/<prefix/>prog_<filename>_<mapname> 句柄上。
*/
for (const auto& location : locations) {
loadAllElfObjects(location) != 0);
}
/*
* 代码中定义的map结构也是通过 bpf() 系统调用加载到内核,然后内核中将这个map 绑
* 定到 /sys/fs/bpf/<prefix/>map_<filename>_<mapname> 句柄上。
*/
int key = 1, value = 123;
android::base::unique_fd map(android::bpf::createMap(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 2, 0));
android::bpf::writeToMapEntry(map, &key, &value, BPF_ANY));
android::base::SetProperty("bpf.progs_loaded", "1");
return 0;
}
三、bpf程序
1. 可以参考 gpu_mem.o 这个elf文件对应的源码
frameworks/native/services/gpuservice/bpfprogs/gpu_mem.c 的实现:
#include <bpf_helpers.h>
#define GPU_MEM_TOTAL_MAP_SIZE 1024
/*
* map名字为 gpu_mem_total_map,是个hash类型的map, hash map的能存储的数据条数是 GPU_MEM_TOTAL_MAP_SIZE。
* 然后是key和val的类型,组指定为 GRAPHICS 组,即只有图形相关服务才能访问它。
* 之后通过与这个map绑定的 /sys/fs/bpf/map_gpu_mem_gpu_mem_total_map 文件通信来取得或修改内核中的数据。
*/
DEFINE_BPF_MAP_GRO(gpu_mem_total_map, HASH, uint64_t, uint64_t, GPU_MEM_TOTAL_MAP_SIZE, AID_GRAPHICS);
/* 这个数据结构要和 gpu_mem_total 这个tracepoint的要严格一致 */
struct gpu_mem_total_args {
/* tracepoint中非用户定义的变量类型,通用的 */
uint64_t ignore;
/* 用户定义的变量,从偏移位置8B处开始 */
uint32_t gpu_id;
uint32_t pid;
uint64_t size;
};
/*
* 这个宏指定了这个prog是个tracepoint类型和对应的tracepoint节点,前面加个点就是这段代码在elf文件中对应
* 的section。后面是属组和权限组。
* 最终这个宏会生成一个函数,并放到 <.name> section 里面。作用是在tracepoint被命中后往map中写数据。
* 内核中会将这段代码绑定到 /sys/fs/bpf/prog_gpu_mem_tracepoint_gpu_mem_gpu_mem_total 句柄上,之后通过
* 这个句柄就可以找到内核中的这段代码。
*/
DEFINE_BPF_PROG("tracepoint/gpu_mem/gpu_mem_total", AID_ROOT, AID_GRAPHICS, tp_gpu_mem_total)
(struct gpu_mem_total_args* args) {
uint64_t key = 0;
uint64_t cur_val = 0;
uint64_t* prev_val = NULL;
/* 先取到tracepoint传进来的类型,就是 gpu_mem_total_args 类型 */
/* The upper 32 bits are for gpu_id while the lower is the pid */
key = ((uint64_t)args->gpu_id << 32) | args->pid;
/* tracepoint被命中后传的内存值,表示这个pid使用这个gpu的多少内存 */
cur_val = args->size;
if (!cur_val) {
bpf_gpu_mem_total_map_delete_elem(&key); //由 DEFINE_BPF_MAP_GRO 宏生成的函数
return 0;
}
/* 先根据key查一下是否已经有存储了,若有的话更新一下,若没有存的话创建一个条目存储 */
prev_val = bpf_gpu_mem_total_map_lookup_elem(&key); //由 DEFINE_BPF_MAP_GRO 宏生成的函数
if (prev_val) {
*prev_val = cur_val;
} else {
bpf_gpu_mem_total_map_update_elem(&key, &cur_val, BPF_NOEXIST); //由 DEFINE_BPF_MAP_GRO 宏生成的函数
}
return 0;
}
LICENSE("Apache 2.0");
编译配置文件 frameworks/native/services/gpuservice/bpfprogs/Android.bp:
package {
default_applicable_licenses: ["frameworks_native_license"],
}
bpf {
name: "gpu_mem.o", //编译生成的目标文件
srcs: ["gpu_mem.c"], //源文件
cflags: [
"-Wall",
"-Werror",
],
}
大致代码逻辑为:
(1) 通过 DEFINE_BPF_MAP 这个Android上层封装的宏定义BPF数据容器的类型以及访问接口。
(2) 通过 DEFINE_BPF_PROG 声明并定义hook方法。
(3) LICENSE指定程序使用的license。
2. 对bpf程序的使用
(1) 激活prog句柄对应的程序段代码
对此例bpf程序的的激活位置在 native/services/gpuservice/gpumem/GpuMem.cpp 中:
static constexpr char kGpuMemTotalProgPath[] = "/sys/fs/bpf/prog_gpu_mem_tracepoint_gpu_mem_gpu_mem_total";
static constexpr char kGpuMemTotalMapPath[] = "/sys/fs/bpf/map_gpu_mem_gpu_mem_total_map";
void GpuMem::initialize() {
/* Make sure bpf programs are loaded */
bpf::waitForProgsLoaded();
int fd = bpf::retrieveProgram(kGpuMemTotalProgPath);
/* 将程序附加到tracepoint,这里会自动启用tracepoint */
while (bpf_attach_tracepoint(fd, "gpu_mem", "gpu_mem_total") < 0) {
if (++count > kGpuWaitTimeout) {
return;
}
/* Retry until GPU driver loaded or timeout */
sleep(1);
}
/* 这里只做了一个只读的映射 */
auto map = bpf::BpfMapRO<uint64_t, uint64_t>(kGpuMemTotalMapPath);
setGpuMemTotalMap(map);
}
附加成功后,当 gpu_mem_total 这个 tracepoint 被命中时map句柄中就有数据了。
(2) 通过map句柄对应的文件进行使用
对此例bpf程序的使用,就是直接读取 /sys/fs/bpf/map_gpu_mem_gpu_mem_total_map 文件,使用位置如:
# cat /sys/fs/bpf/map_gpu_mem_gpu_mem_total_map
4205: 14106624
0: 425660416
10341: 16977920
...
也通过下面这种方法查看:
root@localhost:# bpftool map list | grep gpu_mem //遍历所有map信息 查看每个map对应的id
17: hash name gpu_mem_total_m flags 0x0
root@localhost:# bpftool map dump id 17 //dump map 详细信息 看来是与它匹配的
[{
"key": 4205,
"value": 14778368
},{
"key": 10341,
"value": 16977920
},{
"key": 2992,
"value": 2686976
},
...
]
Android中的这个gpu服务读取的内存统计信息来自这个bpf程序:
# dumpsys gpu --gpumem
Memory snapshot for GPU 0:
Global total: 358850560
Proc 1655 total: 184938496
Proc 2174 total: 2658304
Proc 2992 total: 2686976
Proc 3956 total: 10371072
Proc 4205 total: 14778368
Proc 5729 total: 26066944
Proc 6110 total: 2654208
Proc 8168 total: 112107520
Proc 10341 total: 16977920
代码上的使用,例如 system/memory/libmeminfo/sysmeminfo.cpp 中对map文件的使用:
bool ReadPerProcessGpuMem([[maybe_unused]] std::unordered_map<uint32_t, uint64_t>* out) {
static constexpr const char kBpfGpuMemTotalMap[] = "/sys/fs/bpf/map_gpu_mem_gpu_mem_total_map";
/* Use the read-only wrapper BpfMapRO to properly retrieve the read-only map. */
auto map = bpf::BpfMapRO<uint64_t, uint64_t>(kBpfGpuMemTotalMap);
out->clear();
auto map_key = map.getFirstKey();
do {
uint64_t key = map_key.value();
uint32_t pid = key; // BPF Key [32-bits GPU ID | 32-bits PID]
auto gpu_mem = map.readValue(key);
...
map_key = map.getNextKey(key);
} while (map_key.ok());
return true;
}
四、bpf elf文件格式解析
1. 可以使用 objdump 来查看 bpf elf 文件的字节码
bpf程序编译出来会生成多个section,所有定义的map结构都会存储在maps这个section里面。
root@localhost:/# llvm-objdump-11 -h -d /system/etc/bpf/gpu_mem.o
/system/etc/bpf/gpu_mem.o: file format elf64-bpf
Sections:
Idx Name Size VMA Type
0 00000000 0000000000000000
1 .strtab 00000110 0000000000000000
2 .text 00000000 0000000000000000 TEXT
3 tracepoint/gpu_mem/gpu_mem_total 00000100 0000000000000000 TEXT //在elf文件中对应的section
4 .reltracepoint/gpu_mem/gpu_mem_total 00000030 0000000000000000
5 maps 00000074 0000000000000000 DATA
6 .maps.gpu_mem_total_map 00000010 0000000000000000 DATA //在elf文件中对应的map
7 progs 0000005c 0000000000000000 DATA
8 bpfloader_min_ver 00000004 0000000000000000 DATA
9 bpfloader_max_ver 00000004 0000000000000000 DATA
10 size_of_bpf_map_def 00000008 0000000000000000 DATA
11 size_of_bpf_prog_def 00000008 0000000000000000 DATA
12 license 0000000b 0000000000000000 DATA
13 .BTF 00000c1b 0000000000000000
14 .llvm_addrsig 00000009 0000000000000000
15 .symtab 00000108 0000000000000000
Disassembly of section tracepoint/gpu_mem/gpu_mem_total:
0000000000000000 <tp_gpu_mem_total>:
0: 61 12 08 00 00 00 00 00 r2 = *(u32 *)(r1 + 8) //r1指向参数gpu_mem_total_args* args,这里跳过公共部分,取出 gpu_id
1: 67 02 00 00 20 00 00 00 r2 <<= 32
2: 61 13 0c 00 00 00 00 00 r3 = *(u32 *)(r1 + 12) //取出 pid
3: 4f 32 00 00 00 00 00 00 r2 |= r3 //gpu_id|pid做成hash key
4: 7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2
5: 79 16 10 00 00 00 00 00 r6 = *(u64 *)(r1 + 16) //取出size
6: 7b 6a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r6
7: 55 06 06 00 00 00 00 00 if r6 != 0 goto +6 <tp_gpu_mem_total+0x70>
8: bf a2 00 00 00 00 00 00 r2 = r10
9: 07 02 00 00 f8 ff ff ff r2 += -8
10: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
12: 85 00 00 00 03 00 00 00 call 3
13: 05 00 10 00 00 00 00 00 goto +16 <tp_gpu_mem_total+0xf0>
14: bf a2 00 00 00 00 00 00 r2 = r10
15: 07 02 00 00 f8 ff ff ff r2 += -8
16: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
18: 85 00 00 00 01 00 00 00 call 1
19: 15 00 02 00 00 00 00 00 if r0 == 0 goto +2 <tp_gpu_mem_total+0xb0>
20: 7b 60 00 00 00 00 00 00 *(u64 *)(r0 + 0) = r6
21: 05 00 08 00 00 00 00 00 goto +8 <tp_gpu_mem_total+0xf0>
22: bf a2 00 00 00 00 00 00 r2 = r10
23: 07 02 00 00 f8 ff ff ff r2 += -8
24: bf a3 00 00 00 00 00 00 r3 = r10
25: 07 03 00 00 f0 ff ff ff r3 += -16
26: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
28: b7 04 00 00 01 00 00 00 r4 = 1
29: 85 00 00 00 02 00 00 00 call 2
30: b7 00 00 00 00 00 00 00 r0 = 0
31: 95 00 00 00 00 00 00 00 exit
也可以通过 bpftool 工具进行查看:
root@localhost:/# bpftool prog | grep gpu
15: tracepoint name tracepoint_gpu_ tag 37955a3ec8581e93
root@localhost:/#
root@localhost:/# bpftool prog dump xlated id 15
0: (61) r2 = *(u32 *)(r1 +8)
1: (67) r2 <<= 32
2: (61) r3 = *(u32 *)(r1 +12)
3: (4f) r2 |= r3
4: (7b) *(u64 *)(r10 -8) = r2
5: (79) r6 = *(u64 *)(r1 +16)
6: (7b) *(u64 *)(r10 -16) = r6
7: (55) if r6 != 0x0 goto pc+6
8: (bf) r2 = r10
9: (07) r2 += -8
10: (18) r1 = map[id:17]
12: (85) call 0xffffffe0a837f6c8#89744
13: (05) goto pc+18
14: (bf) r2 = r10
15: (07) r2 += -8
16: (18) r1 = map[id:17]
18: (85) call 0xffffffe0a837f590#89432
19: (15) if r0 == 0x0 goto pc+1
20: (07) r0 += 56
21: (15) if r0 == 0x0 goto pc+2
22: (7b) *(u64 *)(r0 +0) = r6
23: (05) goto pc+8
24: (bf) r2 = r10
25: (07) r2 += -8
26: (bf) r3 = r10
27: (07) r3 += -16
28: (18) r1 = map[id:17]
30: (b7) r4 = 1
31: (85) call 0xffffffe0a837f648#89616
32: (b7) r0 = 0
33: (95) exit
可以看到 tp_gpu_mem_total 前面是按下面的format格式解析参数。
# cat /sys/kernel/tracing/events/gpu_mem/gpu_mem_total/format
name: gpu_mem_total
ID: 671
format:
field:unsigned short common_type; offset:0; size:2; signed:0; //前面8字节是tracepoint通用的
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:uint32_t gpu_id; offset:8; size:4; signed:0; //后面的这些才是用户定义的
field:uint32_t pid; offset:12; size:4; signed:0;
field:uint64_t size; offset:16; size:8; signed:0;
print fmt: "gpu_id=%u pid=%u size=%llu", REC->gpu_id, REC->pid, REC->size
也可以使用 readelf 来看各个 section 的信息,还可以看到偏移位置:
root@localhost:/system/etc/bpf# llvm-readelf-11 -s -S gpu_mem.o
There are 16 section headers, starting at offset 0x10c0:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .strtab STRTAB 0000000000000000 000fa9 000110 00 0 0 1
[ 2] .text PROGBITS 0000000000000000 000040 000000 00 AX 0 0 4
[ 3] tracepoint/gpu_mem/gpu_mem_total PROGBITS 0000000000000000 000040 000100 00 AX 0 0 8
[ 4] .reltracepoint/gpu_mem/gpu_mem_total REL 0000000000000000 000f70 000030 10 I 15 3 8
[ 5] maps PROGBITS 0000000000000000 000140 000074 00 A 0 0 4
[ 6] .maps.gpu_mem_total_map PROGBITS 0000000000000000 0001b8 000010 00 WA 0 0 8
[ 7] progs PROGBITS 0000000000000000 0001c8 00005c 00 A 0 0 4
[ 8] bpfloader_min_ver PROGBITS 0000000000000000 000224 000004 00 WA 0 0 4
[ 9] bpfloader_max_ver PROGBITS 0000000000000000 000228 000004 00 WA 0 0 4
[10] size_of_bpf_map_def PROGBITS 0000000000000000 000230 000008 00 WA 0 0 8
[11] size_of_bpf_prog_def PROGBITS 0000000000000000 000238 000008 00 WA 0 0 8
[12] license PROGBITS 0000000000000000 000240 00000b 00 WA 0 0 1
[13] .BTF PROGBITS 0000000000000000 00024c 000c1b 00 0 0 4
[14] .llvm_addrsig LLVM_ADDRSIG 0000000000000000 000fa0 000009 00 E 0 0 1
[15] .symtab SYMTAB 0000000000000000 000e68 000108 18 1 2 8
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
p (processor specific)
Symbol table '.symtab' contains 11 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 SECTION LOCAL DEFAULT 3 tracepoint/gpu_mem/gpu_mem_total
2: 0000000000000000 256 FUNC GLOBAL DEFAULT 3 tp_gpu_mem_total
3: 0000000000000000 116 OBJECT GLOBAL DEFAULT 5 gpu_mem_total_map
4: 0000000000000000 16 OBJECT GLOBAL DEFAULT 6 ____btf_map_gpu_mem_total_map
5: 0000000000000000 92 OBJECT GLOBAL DEFAULT 7 tp_gpu_mem_total_def
6: 0000000000000000 4 OBJECT GLOBAL DEFAULT 8 _bpfloader_min_ver
7: 0000000000000000 4 OBJECT GLOBAL DEFAULT 9 _bpfloader_max_ver
8: 0000000000000000 8 OBJECT GLOBAL DEFAULT 10 _size_of_bpf_map_def
9: 0000000000000000 8 OBJECT GLOBAL DEFAULT 11 _size_of_bpf_prog_def
10: 0000000000000000 11 OBJECT GLOBAL DEFAULT 12 _license
五、总结
1. 当 bpfloader 服务起来时会加载 BpfLoader.cpp 中指定的所有路径下的所有的.o格式的bpf elf格式的文件,然后在 /sys/fs/bpf 目录下导出 prog句柄和 map句柄,其中 prog 句柄对应的是elf程序代码段,map句柄
对应的是数据读写访问接口。需要有对应的程序将 prog 段附加到对应的HOOK位置上,这样当HOOK被命中时才能采集到数据,采集到的数据会通过map句柄文件导出给其它进程读写。
六、相关资料
1. 有eBPF架构图