forked from microsoft/varuna
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlaunching.html
129 lines (113 loc) · 5.8 KB
/
launching.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Launching Varuna — Varuna documentation</title>
<link rel="stylesheet" href="_static/alabaster.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: './',
VERSION: '',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: '.txt'
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="CutPoints" href="cutpoint.html" />
<link rel="prev" title="Varuna documentation" href="index.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" />
</head>
<body>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="launching-varuna">
<h1>Launching Varuna<a class="headerlink" href="#launching-varuna" title="Permalink to this headline">¶</a></h1>
<p>Varuna distributed training is run as a set of processes for each GPU on each machine. It uses PyTorch’s
distributed framework and must be used with the gloo backend.
This distributed process is triggered from a single machine with a list of reachable machines (IPs) as
<cite>machine_list</cite> and <cite>gpus_per_node</cite> GPUs on each node. This triggering machine is usually the ‘manager’
(as explained in <a class="reference internal" href="morphing.html"><span class="doc">Morphing</span></a>). Morphing is enabled by default, to disable it use the <cite>no_morphing</cite> flag.
Training with varuna can be run with the <cite>run_varuna</cite> module as follows:</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>python -m varuna.run_varuna --machine_list <file_with_ips> --gpus_per_node <num_gpus_per_node>
--batch-size <total_effective_batch_size> --nstages <number_of_pipeline_stages>
--chunk_size <micro_batch_size_for_pipeline>
--code_dir <working_dir_for_training> user_training_script.py <...user args...>
</pre></div>
</div>
<p>This expects all machines in the <cite>machine_list</cite> to be reachable and to be
set up with necessary code/libraries in <cite>code_dir</cite>. The user’s code should also
be modified to add <cite>CutPoint`s and use the `Varuna</cite> training class.
The job is launched with all workers (<cite>gpus_per_node x <num-servers></cite> in total)
running the <cite>user_training_script</cite> with user args and arguments passed by varuna’s launcher.
Any environment variables that the user wishes to pass to each worker may be specified
in an <cite>env_file</cite> passed to the launcher.</p>
<p>These arguments passed by the launcher to the user training script for Varuna
must be parsed by user’s training script and passed during <cite>Varuna</cite> initialisation:</p>
<ul class="simple">
<li>rank: process rank in overall distributed job</li>
<li>local_rank: process rank in the local node</li>
<li>stage_to_rank_map: varuna config info about stage placement</li>
<li>chunk_size: micro batch size for Varuna pipeline</li>
<li>batch-size: per process batch size</li>
</ul>
<p>The arguments for number of pipeline stages <cite>nstages</cite> and micro-batch size <cite>chunk_size</cite> can be
omitted if the user wishes Varuna to determine the most optimal configuration for these.
This requires the user to run profiling before training and pass the location of stored
profiles to the launcher. (see <a class="reference internal" href="profiler.html"><span class="doc">Profiling for Varuna</span></a>)</p>
</div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper"><div class="relations">
<h3>Related Topics</h3>
<ul>
<li><a href="index.html">Documentation overview</a><ul>
<li>Previous: <a href="index.html" title="previous chapter">Varuna documentation</a></li>
<li>Next: <a href="cutpoint.html" title="next chapter">CutPoints</a></li>
</ul></li>
</ul>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="_sources/launching.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<div><input type="text" name="q" /></div>
<div><input type="submit" value="Go" /></div>
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="footer">
©2021, Nitika Saran.
|
Powered by <a href="http://sphinx-doc.org/">Sphinx 1.6.7</a>
& <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.8</a>
|
<a href="_sources/launching.rst.txt"
rel="nofollow">Page source</a>
</div>
</body>
</html>