一 应用场景描述
线上一台mongos出现OOM情况,于是花点时间想要详细了解Linux内核的OOM机制原理,便于以后再作分析
$ sudo grep mongos /var/log/messages Apr 10 15:35:38 localhost sz[32066]: [xxxx] check_mongos.sh/ZMODEM: 211 Bytes, 229 BPS Apr 23 14:50:18 localhost sz[5794]: [xxxxx] mongos/ZMODEM: 297 Bytes, 151 BPS Apr 23 15:01:55 localhost kernel: [20387] 497 20387 694326 427932 0 0 0 mongos Apr 23 15:01:55 localhost kernel: Out of memory: Kill process 20387 (mongos) score 890 or sacrifice child Apr 23 15:01:55 localhost kernel: Killed process 20387, UID 497, (mongos) total-vm:2777304kB, anon-rss:1711700kB, file-rss:28kB
mongos这台机器的内存不足触发了Linux内核的OOM机制,然后把mongos进程给kill掉了
二 了解OOM原理
下载Linux内核源码查看OOM相关代码
查看oom_kill.c源代码里面的内容
linux-2.6.32.65/mm/oom_kill.c
/* * linux/mm/oom_kill.c * * Copyright (C) 1998,2000 Rik van Riel * Thanks go out to Claus Fischer for some serious inspiration and * for goading me into coding this file... * * The routines in this file are used to kill a process when * we're seriously out of memory. This gets called from __alloc_pages() * in mm/page_alloc.c when we really run out of memory. * * Since we won't call these routines often (on a well-configured * machine) this file will double as a 'coding guide' and a signpost * for newbie kernel hackers. It features several pointers to major * kernel subsystems and hints as to where to find out what things do. */
这个文件的步骤用于当内存严重耗尽时如何去选择性地杀掉一个进程。这些步奏不经常调用。
/** * badness - calculate a numeric value for how bad this task has been * @p: task struct of which task we should calculate * @uptime: current uptime in seconds * * The formula used is relatively simple and documented inline in the * function. The main rationale is that we want to select a good task * to kill when we run out of memory. * * Good in this context means that: * 1) we lose the minimum amount of work done * 2) we recover a large amount of memory * 3) we don't kill anything innocent of eating tons of memory * 4) we want to kill the minimum amount of processes (one) * 5) we try to kill the process the user expects us to kill, this * algorithm has been meticulously tuned to meet the principle * of least surprise ... (be careful when you change it) */
unsigned long badness(struct task_struct *p, unsigned long uptime)
badness函数会为每个进程计算一个值来描述这个任务有多bad
所要选择被杀死的进程符合以下特征:
1)杀掉这个进程会花费最少量的工作
2)杀掉这个进程后会恢复很大一部分内存
3)不杀掉任何消耗大量内存的无辜进程
4)尽可能地杀掉最少量的进程
5)尝试杀掉用户希望杀死的进程
unsigned long points, cpu_time, run_time; struct mm_struct *mm; struct task_struct *child; int oom_adj = p->signal->oom_adj; struct task_cputime task_time; unsigned long utime; unsigned long stime;
/* * The memory size of the process is the basis for the badness. */ points = mm->total_vm;
进程使用的内存大小是判断badness的基础
/* * swapoff can easily use up all memory, so kill those first. */ if (p->flags & PF_OOM_ORIGIN) return ULONG_MAX;
swapoff最容易用光所有内存,先杀掉这些进程
/* * Processes which fork a lot of child processes are likely * a good choice. We add half the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children. In case a single * child is eating the vast majority of memory, adding only half * to the parents will make the child our kill candidate of choice. */ list_for_each_entry(child, &p->children, sibling) { task_lock(child); if (child->mm != mm && child->mm) points += child->mm->total_vm/2 + 1; task_unlock(child); }
那些fork出很多子进程的进程是一个很好的选择。
/* * CPU time is in tens of seconds and run time is in thousands * of seconds. There is no particular reason for this other than * that it turned out to work very well in practice. */ thread_group_cputime(p, &task_time); utime = cputime_to_jiffies(task_time.utime); stime = cputime_to_jiffies(task_time.stime); cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
/* * Niced processes are most likely less important, so double * their badness points. */ if (task_nice(p) > 0) points *= 2;
设置nice值得进程是最可能不重要的进程,这里讲他们的badness得分加倍
/* * Superuser processes are usually more important, so we make it * less likely that we kill those. */ if (has_capability_noaudit(p, CAP_SYS_ADMIN) || has_capability_noaudit(p, CAP_SYS_RESOURCE)) points /= 4;
使用超级用户运行的进程是很重要的进程,所以最不可能被杀掉的进程
/* * We don't want to kill a process with direct hardware access. * Not only could that mess up the hardware, but usually users * tend to only have this flag set on applications they think * of as important. */ if (has_capability_noaudit(p, CAP_SYS_RAWIO)) points /= 4;
直接访问硬件的进程不容易被杀掉
/* * If p's nodes don't overlap ours, it may still help to kill p * because p may have allocated or otherwise mapped memory on * this node before. However it will be less likely. */ if (!has_intersects_mems_allowed(p)) points /= 8;
/* * Adjust the score by oom_adj. */ if (oom_adj) { if (oom_adj > 0) { if (!points) points = 1; points <<= oom_adj; } else points >>= -(oom_adj); }
通过oom_adj来调整得分
ifdef DEBUG printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n", p->pid, p->comm, points); #endif return points; }
输出得分
/* * Simple selection loop. We chose the process with the highest * number of 'points'. We expect the caller will lock the tasklist. * * (not docbooked, we don't want this one cluttering up the manual) */
循环比较,选出得分最高的进程。
/* * skip kernel threads and tasks which have already released * their mm. */ if (!p->mm) continue; /* skip the init task */ if (is_global_init(p)) continue; if (mem && !task_in_mem_cgroup(p, mem)) continue;
跳过那些已经释放到内存的内核线程和任务,跳过init task
/** * dump_tasks - dump current memory state of all system tasks * @mem: target memory controller * * Dumps the current memory state of all system tasks, excluding kernel threads. * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj * score, and name. * * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are * shown. * * Call with tasklist_lock read-locked. */
static void dump_tasks(const struct mem_cgroup *mem)
/* * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO * set. */ static void __oom_kill_task(struct task_struct *p, int verbose)
/* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */
/* Try to kill a child first */
** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 * * If we run out of memory, we have the choice between either * killing a random task (bad), letting the system crash (worse) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */
内存溢出,当内存溢出时杀掉最优的进程。
如果出现内存溢出,要么选择随机杀掉一个进程或者直接让系统崩溃,或者尝试有选择性地杀掉一个进程。
在Linux中当malloc返回的是非空时,并不代表有可以使用的内存空间。Linux系统允许程序申请比系统可用内存更多的内存空间,这个特性叫做overcommit特性,这样做可能是为了系统的优化,因为不是所有的程序申请了内存就会立刻使用,当真正的使用时,系统可能已经回收了一下内存。但是,当你使用时Linux系统没有内存可以使用时,OOM Killer就会出来让一些进程退出
Linux内核支持三种overcommit处理模式:
通过系统文件/proc/sys/vm/overcommit_memory进行设置,系统默认是0
0 Heuristic overcommit handling。启发式overcommit处理模式。太过明显的overcommit会被拒绝。root用户可以比普通用户多分配内存。
1 Always overcommit。总是可以overcommit。适用于一些科学计算的应用
2 Don't overcommit。不允许overcommit。应用程序允许分配的地址空间不能超过swap+总的物理内存*overcommit_ratio.这种情况下,当系统不能为应用程序分配更过内存时,不会被杀掉,而是会报内存分配错误。通过/proc/sys/vm/overcommit_ratio 进行设置,默认是50.也就是说如果有512MB的swap和2G物理内存,那么上述mongos进程最大可以从系统分配的内存大小为512MB+2GB*%50=1.5GB
可以通过/proc/meminfo来查看当前的Commit信息
$ cat /proc/meminfo|grep Commit CommitLimit: 1518820 kB Committed_AS: 693100 kB
前面讲到可以通过调整进程的oom_adj来防止被OOM Killer杀掉的概率。
查看include/linux/oom.h
#ifndef __INCLUDE_LINUX_OOM_H #define __INCLUDE_LINUX_OOM_H /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */ #define OOM_DISABLE (-17) /* inclusive */ #define OOM_ADJUST_MIN (-16) #define OOM_ADJUST_MAX 15
这里定义oom_adj的值为-17就不会被OOM Killer杀掉。oom_adj可以调整的范围为-16到15.数字越大越容易被杀掉。
最终是通过oom_score这个值得大小来决定的。
为了防止mongos进程再一次被OOM Killer杀掉,这里将设置mongos进程的oom_adj为-17.
$ sudo pidof mongos 7553 $ cat /proc/7553/oom_adj -17
那我们经常通过SSH来远程连接服务器,要是SSH进程被OOM Killer杀掉我们不是就进入不了服务器了。
这里查看一下SSH进程
$ sudo pidof sshd 2160 [gintama@gintama-qa-server vm]$ ps -ef|grep sshd root 2160 1 0 2014 ? 00:09:52 /usr/sbin/sshd [gintama@gintama-qa-server vm]$ cat /proc/2160/oom_adj -17
可以看到sshd这个进程的oom_adj默认就是设置成了-17,即永远不会被OOM Killer杀掉。
再来看系统内核oom相关的几个参数
/proc/sys/vm/panic_on_oom
/proc/sys/vm/oom_dump_tasks
/proc/sys/vm/oom_kill_allocating_task
panic_on_oom表示出现OOM的时候内核崩溃,慎用。默认是0
oom_dump_tasks dump出所有系统任何的当前内存状态,默认是1
oom_kill_allocating_task OOM Kiiler会直接杀掉申请过多内存的进程
static void __out_of_memory(gfp_t gfp_mask, int order) { struct task_struct *p; unsigned long points; if (sysctl_oom_kill_allocating_task) if (!oom_kill_process(current, gfp_mask, order, 0, NULL, "Out of memory (oom_kill_allocating_task)")) return;
参考文章:
http://blog.chinaunix.net/uid-20788636-id-4308527.html
http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html