Better failure message on hitting resource limitations. #5

michaeljs1990 · 2019-04-02T17:23:28Z

Currently when someone schedules a job that has a chance of using all resources allocated by it's cgroup it reports REASON_COMMAND_EXECUTOR_FAILED in the UI. From looking at the host that this happens on it seems like peloton/mesos knows that it is failing from hitting this limit...

Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b

Would it be possible to bubble up in the UI the reason for the job being killed was due to resource constraint and not due to any issue with the code itself that was running.

The text was updated successfully, but these errors were encountered:

zhixinwen · 2019-04-02T17:47:37Z

Could you point it to me where do you see the "killed as a result of limit..." message?

michaeljs1990 · 2019-04-02T18:12:56Z

I saw that from dmesg on the host. here is the full output with some stuff removed.

[1226766.998728] Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b
[1226767.011413] memory: usage 5275648kB, limit 5275648kB, failcnt 246819
[1226767.017969] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.024779] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.030975] Memory cgroup stats for /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b: cache:67620KB rss:5208028KB rss_huge:0KB mapped_file:67584KB dirty:0KB writeback:0KB inactive_anon:67584KB active_anon:5208028KB inactive_file:8KB active_file:8KB unevictable:0KB
[1226767.054381] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[1226767.063398] [26342]     0 26342    43688    10963      84       3        0             0 mesos-container
[1226767.073045] [26400]     0 26400   654224    11477     152       5        0             0 mesos-executor
[1226767.082643] [26468] 16451 26468     1111      213       7       3        0             0 sh
[1226767.091187] [26472] 16451 26472   709790    89013     659       5        0             0 python2.7
[1226767.100347] [26595] 16451 26595   318312     7020      77       4        0             0 dbh_clone
[1226767.109477] [26788] 16451 26788 10221696  1452438    4080      22        0             0 python2.7
[1226767.118639] [26814] 16451 26814    81541     4277     125       3        0             0 dbh_clone
[1226767.127788] Memory cgroup out of memory: Kill process 26788 (python2.7) score 1104 or sacrifice child
[1226767.137218] Killed process 26814 (dbh_clone) total-vm:326164kB, anon-rss:11204kB, file-rss:5904kB
[1226831.939895] python2.7 invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
[1226831.947980] python2.7 cpuset=/ mems_allowed=0-1
[1226831.952737] CPU: 21 PID: 26774 Comm: python2.7 Tainted: P           OE   4.4.92 #1
[1226831.960489] Hardware name: Redacted, BIOS Redacted
[1226831.968563]  0000000000000286 25203cfc5af37d64 ffffffff812f97a5 ffff883fcccf3e20
[1226831.976208]  ffff881fcee8cc00 ffffffff811db195 ffff883ff22e6a00 ffffffff810a1630
[1226831.983853]  ffff883ff22e6a00 ffffffff8116dd26 ffff880ffcbc4f80 ffff883f368762b8
[1226831.991480] Call Trace:
[1226831.994102]  [<ffffffff812f97a5>] ? dump_stack+0x5c/0x77
[1226831.999585]  [<ffffffff811db195>] ? dump_header+0x62/0x1d7
[1226832.005240]  [<ffffffff810a1630>] ? check_preempt_curr+0x50/0x90
[1226832.011422]  [<ffffffff8116dd26>] ? find_lock_task_mm+0x36/0x80
[1226832.017516]  [<ffffffff8116e2b1>] ? oom_kill_process+0x211/0x3d0
[1226832.023696]  [<ffffffff811d385f>] ? mem_cgroup_iter+0x1cf/0x360
[1226832.029798]  [<ffffffff811d56f3>] ? mem_cgroup_out_of_memory+0x283/0x2c0
[1226832.036671]  [<ffffffff811d63cd>] ? mem_cgroup_oom_synchronize+0x32d/0x340
[1226832.043714]  [<ffffffff811d1a80>] ? mem_cgroup_begin_page_stat+0x90/0x90
[1226832.050589]  [<ffffffff8116e994>] ? pagefault_out_of_memory+0x44/0xc0
[1226832.057214]  [<ffffffff815a98b8>] ? page_fault+0x28/0x30

On thinking about it more i'm guessing mesos might not actually know that this is getting killed for ooming but thought it was worth looking into.

varungup90 · 2019-04-03T17:37:04Z

@michaeljs1990 Thanks for raising the concern, I looked more into it and Mesos exposes detailed reasons why the container was terminated which includes memory limit (ln 2607). I will fix this.

michaeljs1990 · 2019-04-04T15:46:53Z

Awesome to hear! thanks.

michaeljs1990 · 2019-06-06T20:40:17Z

Was this added? I believe I was seeing better error messaged in the UI around this now or possibly I'm imagining things.

mabansal · 2019-06-11T22:40:52Z

@vargup did you add the change ?

talaniz · 2019-07-01T18:23:53Z

@vargup bump, can you please advise if this was changed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better failure message on hitting resource limitations. #5

Better failure message on hitting resource limitations. #5

michaeljs1990 commented Apr 2, 2019 •

edited

Loading

zhixinwen commented Apr 2, 2019

michaeljs1990 commented Apr 2, 2019 •

edited

Loading

varungup90 commented Apr 3, 2019

michaeljs1990 commented Apr 4, 2019

michaeljs1990 commented Jun 6, 2019 •

edited

Loading

mabansal commented Jun 11, 2019

talaniz commented Jul 1, 2019

Better failure message on hitting resource limitations. #5

Better failure message on hitting resource limitations. #5

Comments

michaeljs1990 commented Apr 2, 2019 • edited Loading

zhixinwen commented Apr 2, 2019

michaeljs1990 commented Apr 2, 2019 • edited Loading

varungup90 commented Apr 3, 2019

michaeljs1990 commented Apr 4, 2019

michaeljs1990 commented Jun 6, 2019 • edited Loading

mabansal commented Jun 11, 2019

talaniz commented Jul 1, 2019

michaeljs1990 commented Apr 2, 2019 •

edited

Loading

michaeljs1990 commented Apr 2, 2019 •

edited

Loading

michaeljs1990 commented Jun 6, 2019 •

edited

Loading