Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better failure message on hitting resource limitations. #5

Open
michaeljs1990 opened this issue Apr 2, 2019 · 7 comments
Open

Better failure message on hitting resource limitations. #5

michaeljs1990 opened this issue Apr 2, 2019 · 7 comments

Comments

@michaeljs1990
Copy link

michaeljs1990 commented Apr 2, 2019

Currently when someone schedules a job that has a chance of using all resources allocated by it's cgroup it reports REASON_COMMAND_EXECUTOR_FAILED in the UI. From looking at the host that this happens on it seems like peloton/mesos knows that it is failing from hitting this limit...

Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b

Would it be possible to bubble up in the UI the reason for the job being killed was due to resource constraint and not due to any issue with the code itself that was running.

@zhixinwen
Copy link
Contributor

Could you point it to me where do you see the "killed as a result of limit..." message?

@michaeljs1990
Copy link
Author

michaeljs1990 commented Apr 2, 2019

I saw that from dmesg on the host. here is the full output with some stuff removed.

[1226766.998728] Task in /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b killed as a result of limit of /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b
[1226767.011413] memory: usage 5275648kB, limit 5275648kB, failcnt 246819
[1226767.017969] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.024779] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[1226767.030975] Memory cgroup stats for /mesos/a824e4e4-8c46-49b0-b2a3-73fa0f7af93b: cache:67620KB rss:5208028KB rss_huge:0KB mapped_file:67584KB dirty:0KB writeback:0KB inactive_anon:67584KB active_anon:5208028KB inactive_file:8KB active_file:8KB unevictable:0KB
[1226767.054381] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[1226767.063398] [26342]     0 26342    43688    10963      84       3        0             0 mesos-container
[1226767.073045] [26400]     0 26400   654224    11477     152       5        0             0 mesos-executor
[1226767.082643] [26468] 16451 26468     1111      213       7       3        0             0 sh
[1226767.091187] [26472] 16451 26472   709790    89013     659       5        0             0 python2.7
[1226767.100347] [26595] 16451 26595   318312     7020      77       4        0             0 dbh_clone
[1226767.109477] [26788] 16451 26788 10221696  1452438    4080      22        0             0 python2.7
[1226767.118639] [26814] 16451 26814    81541     4277     125       3        0             0 dbh_clone
[1226767.127788] Memory cgroup out of memory: Kill process 26788 (python2.7) score 1104 or sacrifice child
[1226767.137218] Killed process 26814 (dbh_clone) total-vm:326164kB, anon-rss:11204kB, file-rss:5904kB
[1226831.939895] python2.7 invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
[1226831.947980] python2.7 cpuset=/ mems_allowed=0-1
[1226831.952737] CPU: 21 PID: 26774 Comm: python2.7 Tainted: P           OE   4.4.92 #1
[1226831.960489] Hardware name: Redacted, BIOS Redacted
[1226831.968563]  0000000000000286 25203cfc5af37d64 ffffffff812f97a5 ffff883fcccf3e20
[1226831.976208]  ffff881fcee8cc00 ffffffff811db195 ffff883ff22e6a00 ffffffff810a1630
[1226831.983853]  ffff883ff22e6a00 ffffffff8116dd26 ffff880ffcbc4f80 ffff883f368762b8
[1226831.991480] Call Trace:
[1226831.994102]  [<ffffffff812f97a5>] ? dump_stack+0x5c/0x77
[1226831.999585]  [<ffffffff811db195>] ? dump_header+0x62/0x1d7
[1226832.005240]  [<ffffffff810a1630>] ? check_preempt_curr+0x50/0x90
[1226832.011422]  [<ffffffff8116dd26>] ? find_lock_task_mm+0x36/0x80
[1226832.017516]  [<ffffffff8116e2b1>] ? oom_kill_process+0x211/0x3d0
[1226832.023696]  [<ffffffff811d385f>] ? mem_cgroup_iter+0x1cf/0x360
[1226832.029798]  [<ffffffff811d56f3>] ? mem_cgroup_out_of_memory+0x283/0x2c0
[1226832.036671]  [<ffffffff811d63cd>] ? mem_cgroup_oom_synchronize+0x32d/0x340
[1226832.043714]  [<ffffffff811d1a80>] ? mem_cgroup_begin_page_stat+0x90/0x90
[1226832.050589]  [<ffffffff8116e994>] ? pagefault_out_of_memory+0x44/0xc0
[1226832.057214]  [<ffffffff815a98b8>] ? page_fault+0x28/0x30

On thinking about it more i'm guessing mesos might not actually know that this is getting killed for ooming but thought it was worth looking into.

@varungup90
Copy link
Contributor

@michaeljs1990 Thanks for raising the concern, I looked more into it and Mesos exposes detailed reasons why the container was terminated which includes memory limit (ln 2607). I will fix this.

@michaeljs1990
Copy link
Author

Awesome to hear! thanks.

@michaeljs1990
Copy link
Author

michaeljs1990 commented Jun 6, 2019

Was this added? I believe I was seeing better error messaged in the UI around this now or possibly I'm imagining things.

@mabansal
Copy link
Collaborator

@vargup did you add the change ?

@talaniz
Copy link

talaniz commented Jul 1, 2019

@vargup bump, can you please advise if this was changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants