You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
softlockups are bugs that cause the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run. The current stack trace is displayed upon detection and the system will stay locked up
什么是hardlockup
hardlockups are bugs that cause the CPU to loop in kernel mode for more than 10 seconds, without letting other interrupts have a chance to run. The current stack trace is displayed upon detection and the system will stay locked up
softlockup检测原理
watchdog线程初始化
staticvoidwatchdog_enable(unsigned intcpu)
{
structhrtimer*hrtimer=this_cpu_ptr(&watchdog_hrtimer);
/* * Start the timer first to prevent the NMI watchdog triggering * before the timer has a chance to fire. */hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function=watchdog_timer_fn; // 线程函数hrtimer_start(hrtimer, ns_to_ktime(sample_period), //4SHRTIMER_MODE_REL_PINNED);
/* Initialize timestamp */__touch_watchdog();
/* Enable the perf event */if (watchdog_enabled&NMI_WATCHDOG_ENABLED)
watchdog_nmi_enable(cpu);
watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO-1);
}
什么是softlockup
什么是hardlockup
softlockup检测原理
watchdog线程初始化
hrtimer默认每4S触发一次。超时函数
watchdog_timer_fn
如何判断是否是softlockup
hrtimer执行路径检查20S内watchdog_touch_ts时间戳没有更新,说明watchdog线程20内没有被唤醒执行,hrtimer执行路径打印softlockup的cpu堆栈或者同时Panic。
唤醒watchdog线程,如果线程得到执行,更新时间戳
在4.18内核上查看watchdog线程调度类
可以看到watchdog线程的调度类是rt.
查看migration线程调度类
可以看到migration线程的调度类是stop。
而stop调度类的优先级最高,为什么,可以看到
schedule
实现stop调度类
dl调度类
cfs调度类
可以看到调度类优先级 依次是stop,dl, rt,cfs.因此watchdog线程的优先级是低于migration线程的优先级。
可以看到dl调度类优先级是比rt高,那就有这样一种可能,当dl周期性执行的时候,rt是可能得不到调度的,可能会产生softlockup 误报。
查看上游代码,发现有个patch似乎解决了上面这个问题,commit id
9cf57731b63e37ed995b46690adc604891a9a28f
可以看到,kernel把更新时间戳的动作放到migration里来做,这样就可以抢占dl调度类的线程了,避免误报。
内核相关配置
migration 线程创建流程
cpu_stopper_thread流程比较简单,从work queue里面取出第一个队列,然后执行
第一个work执行完了后,然后取下一个work继续执行。
The text was updated successfully, but these errors were encountered: