-
Notifications
You must be signed in to change notification settings - Fork 29
/
morphing.html
146 lines (130 loc) · 6.46 KB
/
morphing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Morphing — Varuna documentation</title>
<link rel="stylesheet" href="_static/alabaster.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: './',
VERSION: '',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: '.txt'
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="prev" title="Profiling for Varuna" href="profiler.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" />
</head>
<body>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="morphing">
<h1>Morphing<a class="headerlink" href="#morphing" title="Permalink to this headline">¶</a></h1>
<p>Varuna enables distributed training on a changing set of resources, as the list
of machines available may grow or shrink. This is done by “morphing” - reconfiguring the
training job to process the total effective batch size over the new resources. Varuna performs
morphing by checkpointing and restarting efficiently, which requires that the training job has access to
a long-living ‘manager’ machine and a global storage for all workers.</p>
<p>The manager launches the <cite>run_varuna</cite> command, detects changes in the available resource set, slow GPUs
or transient errors in the job, and cooridinates checkpoint/restarts. If desirable, the manager
can be notified of an upcoming preemption (loss of a machine) through the function <cite>notify_manager_preempt</cite>.
For example in Azure, a ‘preempt’ signal is issued with preemption time.</p>
<p>To enable morphing, the user must make some modifications to their script:</p>
<ul class="simple">
<li><dl class="first docutils">
<dt>An additional <cite>resume_step</cite> argument is passed to each worker for restarts. (So that there</dt>
<dd>are no race conditions while checking this step from the global storage)</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt>A simple signal handler for <cite>SIGUSR1</cite> in the workers to call varuna’s <cite>on_demand_checkpoint</cite> (<a class="reference internal" href="varuna.html"><span class="doc">The Varuna class</span></a>)</dt>
<dd>and exit. The checkpointing may fail if workers are lost during the call.</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt>(recommended) With morphing, <cite>Varuna</cite> checkpointing should be enabled with background copying and sharding flags for</dt>
<dd>faster checkpointing. The checkpoint frequency should be high to avoid loss of compute on checkpoint/restarts
(in case on demand checkpoints fail).</dd>
</dl>
</li>
</ul>
<p>These changes are illustrated in the megatron example.</p>
<p>The key idea behind morphing is to re-distribute the total <cite>batch_size</cite> specified by the user accross
pipeline parallel stages and data parallel replicas. To do this efficiently, it is recommended to use
auto-configuration of the dimensions of pipeline and data parallelism as well as the micro-batch size.
<cite>AutoConfig</cite> by varuna is enabled if these arguments (<cite>nstages</cite> and <cite>chunk_size</cite>) are not specified
while launching <cite>run_varuna</cite>. This estimates the best varuna configuration at each point and requires
the user to run profiling before training and specify the location of stored profiles to the
launcher. (see <a class="reference internal" href="profiler.html"><span class="doc">Profiling for Varuna</span></a>)</p>
<div class="section" id="slow-gpu-detection">
<h2>Slow GPU detection<a class="headerlink" href="#slow-gpu-detection" title="Permalink to this headline">¶</a></h2>
<p>With low-priority VMs, a user might see faulty “straggler” GPUs that have significantly longer compute
times than the others. These are detected by varuna when morphing is enabled by the manager.
The IPs with the slow GPUs are written to a file “slow_machines.out”. The user may listen on this file
to remove machines with faulty GPUs.</p>
</div>
</div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<h3><a href="index.html">Table Of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">Morphing</a><ul>
<li><a class="reference internal" href="#slow-gpu-detection">Slow GPU detection</a></li>
</ul>
</li>
</ul>
<div class="relations">
<h3>Related Topics</h3>
<ul>
<li><a href="index.html">Documentation overview</a><ul>
<li>Previous: <a href="profiler.html" title="previous chapter">Profiling for Varuna</a></li>
</ul></li>
</ul>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="_sources/morphing.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<div><input type="text" name="q" /></div>
<div><input type="submit" value="Go" /></div>
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="footer">
©2021, Nitika Saran.
|
Powered by <a href="http://sphinx-doc.org/">Sphinx 1.6.7</a>
& <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.8</a>
|
<a href="_sources/morphing.rst.txt"
rel="nofollow">Page source</a>
</div>
</body>
</html>