The sched-pipeline
scheduling application is compiled and installed when the Python package is installed.
Running the application requires inputs that describe: (1) models, (2) device types, and (3) available devices. For simplicity, we represent this information in YAML files, which are essentially hierarchies of maps, lists, and primitive types.
The information in these files is derived from model and device profiling - see README_Profiler.md. The profiler helpers can produce files (1) and (2). File (3) maps device type names to concrete hosts and is straightforward to create by hand (or other automated means, depending on the deployment context).
By default, sched-pipeline
produces a schedule based on the input YAML files described below.
For detailed usage instructions, see the help output:
sched-pipeline -h
The schedule is reported in YAML format as a list.
For each list entry: the key is a host, the value is a length-2 list with the start and end layers scheduled for that host.
The list is sorted in stage order, i.e., the layers in stage n
precede the layers in stage n+1
.
Only hosts that are assigned layers are included in the output.
For example:
- mb-0: [1, 6]
- mb-1: [7, 12]
- mb-2: [13, 18]
- mb-3: [19, 24]
- rcc-0: [25, 30]
- rcc-1: [31, 36]
- rcc-2: [37, 42]
- rcc-3: [43, 48]
Default name: models.yml
.
This file is a mapping of model names to model properties. Each unique model name entry is a map with keys and values:
layers
:int
: number of layers in the model.mem_MB
:List[float]
: memory requirements for each layer. Length must matchlayers
value.parameters_in
:int
: total number of per-microbatch input parameters for the first layer.parameters_out
:List[int]
: total number of per-microbatch output parameters for each layer. Length must matchlayers
value.
Note: When using microbatch size > 1, the actual parameter counts would be a multiple of values specified. The scheduling application accounts for this.
For example:
DummyModel:
layers: 0
mem_MB: []
parameters_in: 0
parameters_out: []
google/vit-base-patch16-224:
layers: 48
mem_MB: [26.808319999999995, 18.927615999999986, 26.488831999999988, 25.927679999999995,
24.289280000000005, 18.919423999999992, 26.550271999999993, 25.858047999999997,
24.428544000000002, 19.128320000000002, 26.763263999999992, 26.071039999999996,
24.506367999999995, 19.20204799999999, 26.939391999999998, 26.210303999999994,
24.760319999999993, 19.275775999999993, 26.886144, 26.218496000000002, 24.694784,
19.456000000000003, 27.02131200000001, 26.333184000000003, 24.829952000000006,
19.460096000000007, 27.02131200000001, 26.402816, 24.83404800000001, 19.468288,
27.033600000000007, 26.41100800000001, 24.842240000000004, 19.47648000000001,
27.041792, 26.419200000000004, 24.850432000000012, 19.484672000000003, 27.045888000000005,
26.42739200000001, 24.854528000000002, 19.488768000000007, 27.04998400000001,
26.431488, 24.866816, 19.49286400000001, 27.058176000000003, 33.062912]
parameters_in: 150528
parameters_out: [302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296,
302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296,
756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296,
302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296,
756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296, 756480, 1000]
google/vit-large-patch16-224:
layers: 96
mem_MB: [34.287615999999986, 21.860351999999992, 34.873344, 34.11148800000001, 30.98214400000002,
21.860352000000006, 34.87334400000002, 34.103296000000014, 30.978047999999987,
21.852159999999998, 34.86515199999998, 34.103295999999986, 30.978047999999987,
21.856255999999988, 34.869247999999985, 34.09100799999999, 30.973951999999997,
21.852159999999998, 34.869248, 34.103296, 30.982144000000005, 21.852159999999998,
34.045952000000014, 34.095104000000006, 30.978047999999987, 21.856256000000002,
34.869247999999985, 34.099199999999996, 30.150655999999998, 21.028864, 34.041855999999996,
33.28, 30.154752000000002, 21.028864000000013, 34.04185600000001, 33.27180800000001,
30.146560000000008, 21.028864, 34.045952000000014, 33.27590400000001, 30.150656000000012,
21.028864, 34.045951999999986, 33.27180799999999, 30.146559999999994, 21.028864,
34.03775999999999, 33.27180799999999, 30.146559999999994, 21.028864, 34.041855999999996,
33.275904, 30.146560000000008, 21.028864, 34.04185600000001, 33.27590400000001,
30.146560000000008, 21.032960000000003, 34.05004799999999, 33.28409599999999,
30.146560000000008, 21.577727999999993, 34.59071999999999, 33.759232, 30.691328000000013,
21.577727999999993, 34.525183999999996, 33.96812799999999, 30.633983999999998,
21.671936000000002, 34.684928, 33.91488, 30.752768000000003, 21.671936000000002,
34.582527999999996, 34.02547200000001, 30.793728, 21.774336000000005, 34.856960000000015,
33.95584000000001, 30.90841599999999, 21.839872, 34.787328, 34.08691200000001,
30.904319999999984, 21.839872, 34.85286400000001, 34.037760000000006, 30.973951999999997,
21.848064000000008, 34.86515200000001, 34.095104000000006, 30.97395200000001,
21.852159999999998, 34.86515200000001, 43.02847999999999]
parameters_in: 150528
parameters_out: [403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 1000]
Default name: device_types.yml
.
This file is a mapping of device type names to their properties and model profiling results. Each unique device type name entry is a map with keys and values:
bw_Mbps
:int
: device bandwidth capability in Megabits per second.mem_MB
:int
: device memory capacity in Megabytes. Users should consider using a value less than the actual physical memory to allow for OS, application, and other runtime overheads.model_profiles
:Map[string, List[Map[string, List]]]
: maps model names to a list of maps, where each map in the list contains profiling configuration properties and device-specific profiling results with keys and values:batch_size
:int
: microbatch size.dtype
:string
: PyTorch datatype.time_s
:List[float]
: time in seconds to process each model layer. Length must matchlayers
value for the matching model in the Models YAML file.
For example:
DummyDeviceType:
bw_Mbps: 0
mem_MB: 0
model_profiles:
Minnowboard-E3845:
bw_Mbps: 1000
mem_MB: 2048
model_profiles:
google/vit-base-patch16-224:
- batch_size: 8
dtype: torch.float32
time_s: [0.46270322799682617, 0.08051528930664062, 0.35246331691741944, 0.3039305925369263,
0.3491475820541382, 0.08227810859680176, 0.351016902923584, 0.2980557680130005,
0.33620967864990237, 0.08834941387176513, 0.3524341106414795, 0.2971898794174194,
0.33597075939178467, 0.0820955753326416, 0.35834517478942873, 0.2971163272857666,
0.3482602596282959, 0.08181514739990234, 0.35093743801116944, 0.3065039157867432,
0.33689374923706056, 0.0821951150894165, 0.35376605987548826, 0.29595146179199217,
0.3442469596862793, 0.08107240200042724, 0.35219473838806153, 0.3041703701019287,
0.3344552040100098, 0.08967757225036621, 0.3515634059906006, 0.3086558818817139,
0.3359591245651245, 0.08169562816619873, 0.35871760845184325, 0.30067503452301025,
0.3408169746398926, 0.08164689540863038, 0.35212528705596924, 0.3067019462585449,
0.33519535064697265, 0.08180253505706787, 0.3570704936981201, 0.2972059965133667,
0.3341459035873413, 0.08241617679595947, 0.3605750560760498, 0.3015031576156616]
google/vit-large-patch16-224:
- batch_size: 8
dtype: torch.float32
time_s: [0.7370278835296631, 0.14219119548797607, 0.6064619302749634, 0.5298462390899659,
0.5576925992965698, 0.14267463684082032, 0.6010080099105835, 0.5382300615310669,
0.5722937822341919, 0.13994674682617186, 0.607061219215393, 0.5289065122604371,
0.5596141338348388, 0.14043643474578857, 0.5998957395553589, 0.5349064111709595,
0.5698701858520507, 0.13965086936950682, 0.6048509359359742, 0.5292475461959839,
0.5626368761062622, 0.14027750492095947, 0.6055158615112305, 0.5379684448242188,
0.5682665348052979, 0.14262542724609376, 0.6071671485900879, 0.5317425727844238,
0.5657262086868287, 0.14081220626831054, 0.6041595935821533, 0.545221495628357,
0.5646415233612061, 0.14060497283935547, 0.6098709821701049, 0.5317424774169922,
0.5644603729248047, 0.14234611988067628, 0.6127593755722046, 0.538119101524353,
0.5603784799575806, 0.14067206382751465, 0.6014557123184204, 0.5267079830169678,
0.56812584400177, 0.14128355979919432, 0.6111029148101806, 0.5405333042144775,
0.5594105243682861, 0.14134061336517334, 0.6002681255340576, 0.5325106620788574,
0.5693668365478516, 0.1424561023712158, 0.6041918992996216, 0.5380329847335815,
0.5647742986679077, 0.1489267349243164, 0.5975847721099854, 0.5356852293014527,
0.570663332939148, 0.140134859085083, 0.6132462024688721, 0.5475344896316529,
0.5567568778991699, 0.1498802423477173, 0.6009885787963867, 0.5272369384765625,
0.5705178022384644, 0.14125945568084716, 0.6114472866058349, 0.5361553430557251,
0.5657816648483276, 0.14820642471313478, 0.6000842809677124, 0.5332064628601074,
0.5666147947311402, 0.13970253467559815, 0.6074207782745361, 0.5382495403289795,
0.558089303970337, 0.1400979995727539, 0.60003662109375, 0.5357809782028198,
0.5718890190124511, 0.14127144813537598, 0.6088854551315308, 0.5300276517868042,
0.5606921911239624, 0.1417999744415283, 0.6010080099105835, 0.5390705347061158,
0.577080512046814, 0.141274356842041, 0.6099730253219604, 0.5324676752090454]
RCC-VE-C2000:
bw_Mbps: 1000
mem_MB: 8192
model_profiles:
google/vit-base-patch16-224:
- batch_size: 8
dtype: torch.float32
time_s: [0.37087414264678953, 0.0655491828918457, 0.2822323560714722, 0.23820888996124268,
0.26973614692687986, 0.06453275680541992, 0.281805157661438, 0.2390226364135742,
0.2694431781768799, 0.06397416591644287, 0.2822160243988037, 0.23875000476837158,
0.27022428512573243, 0.06376738548278808, 0.2834264039993286, 0.2382676601409912,
0.2690746307373047, 0.06357581615447998, 0.28319170475006106, 0.2380734920501709,
0.26836357116699217, 0.06478390693664551, 0.2812931060791016, 0.2382678508758545,
0.2691538333892822, 0.06470081806182862, 0.2814581871032715, 0.23822240829467772,
0.2723070621490479, 0.06475505828857422, 0.2830230951309204, 0.23807191848754883,
0.2696284055709839, 0.06475844383239746, 0.28170797824859617, 0.24239492416381836,
0.2679957628250122, 0.06474094390869141, 0.2829035043716431, 0.23935911655426026,
0.27569262981414794, 0.06489593982696533, 0.2889456033706665, 0.2413492202758789,
0.2704929828643799, 0.06354374885559082, 0.2837244749069214, 0.24390978813171388]
google/vit-large-patch16-224:
- batch_size: 8
dtype: torch.float32
time_s: [0.5827305793762207, 0.11447184085845948, 0.4874413251876831, 0.43192110061645506,
0.45414621829986573, 0.11448442935943604, 0.488907790184021, 0.43297643661499025,
0.45159590244293213, 0.11437134742736817, 0.4854918956756592, 0.434900164604187,
0.44939398765563965, 0.11359941959381104, 0.48420276641845705, 0.4322864770889282,
0.4463037014007568, 0.1139305591583252, 0.4835365295410156, 0.43552446365356445,
0.45094101428985595, 0.11440064907073974, 0.4878371238708496, 0.43801352977752683,
0.4539456367492676, 0.11601903438568115, 0.4887869834899902, 0.43662848472595217,
0.4527653694152832, 0.11409034729003906, 0.48731725215911864, 0.4371277570724487,
0.4497154474258423, 0.11410114765167237, 0.4903388500213623, 0.4329125165939331,
0.4473557472229004, 0.11284692287445068, 0.48941740989685056, 0.4334478139877319,
0.4493544578552246, 0.11308140754699707, 0.4854674577713013, 0.43252930641174314,
0.4533524751663208, 0.11461765766143799, 0.4840975046157837, 0.4328861474990845,
0.4519708871841431, 0.11398425102233886, 0.48682806491851804, 0.43622119426727296,
0.45392632484436035, 0.114825439453125, 0.4881617546081543, 0.4374098777770996,
0.4514464855194092, 0.11506931781768799, 0.4865990161895752, 0.43652758598327634,
0.4544330358505249, 0.11377334594726562, 0.48940234184265136, 0.4335435390472412,
0.4524759531021118, 0.11278202533721923, 0.48830137252807615, 0.43258821964263916,
0.4586972713470459, 0.11466395854949951, 0.4877305030822754, 0.4355809211730957,
0.448706316947937, 0.11413455009460449, 0.48580987453460694, 0.43746769428253174,
0.45195415019989016, 0.11435890197753906, 0.4878537654876709, 0.4340193748474121,
0.4511066436767578, 0.11468629837036133, 0.4865649461746216, 0.43074941635131836,
0.4491808176040649, 0.11199946403503418, 0.4876744747161865, 0.43846864700317384,
0.4477973461151123, 0.114066743850708, 0.486585807800293, 0.43675496578216555,
0.4586408376693726, 0.1126516342163086, 0.48655402660369873, 0.43992033004760744]
Default name: devices.yml
.
This file is a mapping of device type names to a list of hosts. Each unique device type name entry is list of hosts (e.g., host names or IP addresses that the runtime can resolve).
For example:
DummyDeviceType: []
Minnowboard-E3845:
- mb-0
- mb-1
- mb-2
- mb-3
RCC-VE-C2558:
- rcc-0
- rcc-1
- rcc-2
- rcc-3
You may build the scheduler application manually, e.g., to experiment with it in isolation, but it will not be found by the runtime scheduler until the Python package is rebuilt/reinstalled.
To build manually:
mkdir src-native/build
cd src-native/build
cmake ..
cmake --build .