Skip to content

Latest commit

 

History

History
278 lines (240 loc) · 14.6 KB

README_Scheduler.md

File metadata and controls

278 lines (240 loc) · 14.6 KB

Scheduler

The sched-pipeline scheduling application is compiled and installed when the Python package is installed.

Running the application requires inputs that describe: (1) models, (2) device types, and (3) available devices. For simplicity, we represent this information in YAML files, which are essentially hierarchies of maps, lists, and primitive types.

The information in these files is derived from model and device profiling - see README_Profiler.md. The profiler helpers can produce files (1) and (2). File (3) maps device type names to concrete hosts and is straightforward to create by hand (or other automated means, depending on the deployment context).

Usage

By default, sched-pipeline produces a schedule based on the input YAML files described below.

For detailed usage instructions, see the help output:

sched-pipeline -h

Example Schedule

The schedule is reported in YAML format as a list. For each list entry: the key is a host, the value is a length-2 list with the start and end layers scheduled for that host. The list is sorted in stage order, i.e., the layers in stage n precede the layers in stage n+1. Only hosts that are assigned layers are included in the output.

For example:

- mb-0: [1, 6]
- mb-1: [7, 12]
- mb-2: [13, 18]
- mb-3: [19, 24]
- rcc-0: [25, 30]
- rcc-1: [31, 36]
- rcc-2: [37, 42]
- rcc-3: [43, 48]

Models

Default name: models.yml.

This file is a mapping of model names to model properties. Each unique model name entry is a map with keys and values:

  • layers: int: number of layers in the model.
  • mem_MB: List[float]: memory requirements for each layer. Length must match layers value.
  • parameters_in: int: total number of per-microbatch input parameters for the first layer.
  • parameters_out: List[int]: total number of per-microbatch output parameters for each layer. Length must match layers value.

Note: When using microbatch size > 1, the actual parameter counts would be a multiple of values specified. The scheduling application accounts for this.

For example:

DummyModel:
  layers: 0
  mem_MB: []
  parameters_in: 0
  parameters_out: []
google/vit-base-patch16-224:
  layers: 48
  mem_MB: [26.808319999999995, 18.927615999999986, 26.488831999999988, 25.927679999999995,
    24.289280000000005, 18.919423999999992, 26.550271999999993, 25.858047999999997,
    24.428544000000002, 19.128320000000002, 26.763263999999992, 26.071039999999996,
    24.506367999999995, 19.20204799999999, 26.939391999999998, 26.210303999999994,
    24.760319999999993, 19.275775999999993, 26.886144, 26.218496000000002, 24.694784,
    19.456000000000003, 27.02131200000001, 26.333184000000003, 24.829952000000006,
    19.460096000000007, 27.02131200000001, 26.402816, 24.83404800000001, 19.468288,
    27.033600000000007, 26.41100800000001, 24.842240000000004, 19.47648000000001,
    27.041792, 26.419200000000004, 24.850432000000012, 19.484672000000003, 27.045888000000005,
    26.42739200000001, 24.854528000000002, 19.488768000000007, 27.04998400000001,
    26.431488, 24.866816, 19.49286400000001, 27.058176000000003, 33.062912]
  parameters_in: 150528
  parameters_out: [302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296,
    302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296,
    756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296,
    302592, 151296, 756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296,
    756480, 151296, 302592, 151296, 756480, 151296, 302592, 151296, 756480, 1000]
google/vit-large-patch16-224:
  layers: 96
  mem_MB: [34.287615999999986, 21.860351999999992, 34.873344, 34.11148800000001, 30.98214400000002,
    21.860352000000006, 34.87334400000002, 34.103296000000014, 30.978047999999987,
    21.852159999999998, 34.86515199999998, 34.103295999999986, 30.978047999999987,
    21.856255999999988, 34.869247999999985, 34.09100799999999, 30.973951999999997,
    21.852159999999998, 34.869248, 34.103296, 30.982144000000005, 21.852159999999998,
    34.045952000000014, 34.095104000000006, 30.978047999999987, 21.856256000000002,
    34.869247999999985, 34.099199999999996, 30.150655999999998, 21.028864, 34.041855999999996,
    33.28, 30.154752000000002, 21.028864000000013, 34.04185600000001, 33.27180800000001,
    30.146560000000008, 21.028864, 34.045952000000014, 33.27590400000001, 30.150656000000012,
    21.028864, 34.045951999999986, 33.27180799999999, 30.146559999999994, 21.028864,
    34.03775999999999, 33.27180799999999, 30.146559999999994, 21.028864, 34.041855999999996,
    33.275904, 30.146560000000008, 21.028864, 34.04185600000001, 33.27590400000001,
    30.146560000000008, 21.032960000000003, 34.05004799999999, 33.28409599999999,
    30.146560000000008, 21.577727999999993, 34.59071999999999, 33.759232, 30.691328000000013,
    21.577727999999993, 34.525183999999996, 33.96812799999999, 30.633983999999998,
    21.671936000000002, 34.684928, 33.91488, 30.752768000000003, 21.671936000000002,
    34.582527999999996, 34.02547200000001, 30.793728, 21.774336000000005, 34.856960000000015,
    33.95584000000001, 30.90841599999999, 21.839872, 34.787328, 34.08691200000001,
    30.904319999999984, 21.839872, 34.85286400000001, 34.037760000000006, 30.973951999999997,
    21.848064000000008, 34.86515200000001, 34.095104000000006, 30.97395200000001,
    21.852159999999998, 34.86515200000001, 43.02847999999999]
  parameters_in: 150528
  parameters_out: [403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
    403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
    1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
    403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
    1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
    403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
    1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
    403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728,
    1008640, 201728, 403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 201728,
    403456, 201728, 1008640, 201728, 403456, 201728, 1008640, 1000]

Device Types

Default name: device_types.yml.

This file is a mapping of device type names to their properties and model profiling results. Each unique device type name entry is a map with keys and values:

  • bw_Mbps: int: device bandwidth capability in Megabits per second.
  • mem_MB: int: device memory capacity in Megabytes. Users should consider using a value less than the actual physical memory to allow for OS, application, and other runtime overheads.
  • model_profiles: Map[string, List[Map[string, List]]]: maps model names to a list of maps, where each map in the list contains profiling configuration properties and device-specific profiling results with keys and values:
    • batch_size: int: microbatch size.
    • dtype: string: PyTorch datatype.
    • time_s: List[float]: time in seconds to process each model layer. Length must match layers value for the matching model in the Models YAML file.

For example:

DummyDeviceType:
  bw_Mbps: 0
  mem_MB: 0
  model_profiles:
Minnowboard-E3845:
  bw_Mbps: 1000
  mem_MB: 2048
  model_profiles:
    google/vit-base-patch16-224:
    - batch_size: 8
      dtype: torch.float32
      time_s: [0.46270322799682617, 0.08051528930664062, 0.35246331691741944, 0.3039305925369263,
        0.3491475820541382, 0.08227810859680176, 0.351016902923584, 0.2980557680130005,
        0.33620967864990237, 0.08834941387176513, 0.3524341106414795, 0.2971898794174194,
        0.33597075939178467, 0.0820955753326416, 0.35834517478942873, 0.2971163272857666,
        0.3482602596282959, 0.08181514739990234, 0.35093743801116944, 0.3065039157867432,
        0.33689374923706056, 0.0821951150894165, 0.35376605987548826, 0.29595146179199217,
        0.3442469596862793, 0.08107240200042724, 0.35219473838806153, 0.3041703701019287,
        0.3344552040100098, 0.08967757225036621, 0.3515634059906006, 0.3086558818817139,
        0.3359591245651245, 0.08169562816619873, 0.35871760845184325, 0.30067503452301025,
        0.3408169746398926, 0.08164689540863038, 0.35212528705596924, 0.3067019462585449,
        0.33519535064697265, 0.08180253505706787, 0.3570704936981201, 0.2972059965133667,
        0.3341459035873413, 0.08241617679595947, 0.3605750560760498, 0.3015031576156616]
    google/vit-large-patch16-224:
    - batch_size: 8
      dtype: torch.float32
      time_s: [0.7370278835296631, 0.14219119548797607, 0.6064619302749634, 0.5298462390899659,
        0.5576925992965698, 0.14267463684082032, 0.6010080099105835, 0.5382300615310669,
        0.5722937822341919, 0.13994674682617186, 0.607061219215393, 0.5289065122604371,
        0.5596141338348388, 0.14043643474578857, 0.5998957395553589, 0.5349064111709595,
        0.5698701858520507, 0.13965086936950682, 0.6048509359359742, 0.5292475461959839,
        0.5626368761062622, 0.14027750492095947, 0.6055158615112305, 0.5379684448242188,
        0.5682665348052979, 0.14262542724609376, 0.6071671485900879, 0.5317425727844238,
        0.5657262086868287, 0.14081220626831054, 0.6041595935821533, 0.545221495628357,
        0.5646415233612061, 0.14060497283935547, 0.6098709821701049, 0.5317424774169922,
        0.5644603729248047, 0.14234611988067628, 0.6127593755722046, 0.538119101524353,
        0.5603784799575806, 0.14067206382751465, 0.6014557123184204, 0.5267079830169678,
        0.56812584400177, 0.14128355979919432, 0.6111029148101806, 0.5405333042144775,
        0.5594105243682861, 0.14134061336517334, 0.6002681255340576, 0.5325106620788574,
        0.5693668365478516, 0.1424561023712158, 0.6041918992996216, 0.5380329847335815,
        0.5647742986679077, 0.1489267349243164, 0.5975847721099854, 0.5356852293014527,
        0.570663332939148, 0.140134859085083, 0.6132462024688721, 0.5475344896316529,
        0.5567568778991699, 0.1498802423477173, 0.6009885787963867, 0.5272369384765625,
        0.5705178022384644, 0.14125945568084716, 0.6114472866058349, 0.5361553430557251,
        0.5657816648483276, 0.14820642471313478, 0.6000842809677124, 0.5332064628601074,
        0.5666147947311402, 0.13970253467559815, 0.6074207782745361, 0.5382495403289795,
        0.558089303970337, 0.1400979995727539, 0.60003662109375, 0.5357809782028198,
        0.5718890190124511, 0.14127144813537598, 0.6088854551315308, 0.5300276517868042,
        0.5606921911239624, 0.1417999744415283, 0.6010080099105835, 0.5390705347061158,
        0.577080512046814, 0.141274356842041, 0.6099730253219604, 0.5324676752090454]
RCC-VE-C2000:
  bw_Mbps: 1000
  mem_MB: 8192
  model_profiles:
    google/vit-base-patch16-224:
    - batch_size: 8
      dtype: torch.float32
      time_s: [0.37087414264678953, 0.0655491828918457, 0.2822323560714722, 0.23820888996124268,
        0.26973614692687986, 0.06453275680541992, 0.281805157661438, 0.2390226364135742,
        0.2694431781768799, 0.06397416591644287, 0.2822160243988037, 0.23875000476837158,
        0.27022428512573243, 0.06376738548278808, 0.2834264039993286, 0.2382676601409912,
        0.2690746307373047, 0.06357581615447998, 0.28319170475006106, 0.2380734920501709,
        0.26836357116699217, 0.06478390693664551, 0.2812931060791016, 0.2382678508758545,
        0.2691538333892822, 0.06470081806182862, 0.2814581871032715, 0.23822240829467772,
        0.2723070621490479, 0.06475505828857422, 0.2830230951309204, 0.23807191848754883,
        0.2696284055709839, 0.06475844383239746, 0.28170797824859617, 0.24239492416381836,
        0.2679957628250122, 0.06474094390869141, 0.2829035043716431, 0.23935911655426026,
        0.27569262981414794, 0.06489593982696533, 0.2889456033706665, 0.2413492202758789,
        0.2704929828643799, 0.06354374885559082, 0.2837244749069214, 0.24390978813171388]
    google/vit-large-patch16-224:
    - batch_size: 8
      dtype: torch.float32
      time_s: [0.5827305793762207, 0.11447184085845948, 0.4874413251876831, 0.43192110061645506,
        0.45414621829986573, 0.11448442935943604, 0.488907790184021, 0.43297643661499025,
        0.45159590244293213, 0.11437134742736817, 0.4854918956756592, 0.434900164604187,
        0.44939398765563965, 0.11359941959381104, 0.48420276641845705, 0.4322864770889282,
        0.4463037014007568, 0.1139305591583252, 0.4835365295410156, 0.43552446365356445,
        0.45094101428985595, 0.11440064907073974, 0.4878371238708496, 0.43801352977752683,
        0.4539456367492676, 0.11601903438568115, 0.4887869834899902, 0.43662848472595217,
        0.4527653694152832, 0.11409034729003906, 0.48731725215911864, 0.4371277570724487,
        0.4497154474258423, 0.11410114765167237, 0.4903388500213623, 0.4329125165939331,
        0.4473557472229004, 0.11284692287445068, 0.48941740989685056, 0.4334478139877319,
        0.4493544578552246, 0.11308140754699707, 0.4854674577713013, 0.43252930641174314,
        0.4533524751663208, 0.11461765766143799, 0.4840975046157837, 0.4328861474990845,
        0.4519708871841431, 0.11398425102233886, 0.48682806491851804, 0.43622119426727296,
        0.45392632484436035, 0.114825439453125, 0.4881617546081543, 0.4374098777770996,
        0.4514464855194092, 0.11506931781768799, 0.4865990161895752, 0.43652758598327634,
        0.4544330358505249, 0.11377334594726562, 0.48940234184265136, 0.4335435390472412,
        0.4524759531021118, 0.11278202533721923, 0.48830137252807615, 0.43258821964263916,
        0.4586972713470459, 0.11466395854949951, 0.4877305030822754, 0.4355809211730957,
        0.448706316947937, 0.11413455009460449, 0.48580987453460694, 0.43746769428253174,
        0.45195415019989016, 0.11435890197753906, 0.4878537654876709, 0.4340193748474121,
        0.4511066436767578, 0.11468629837036133, 0.4865649461746216, 0.43074941635131836,
        0.4491808176040649, 0.11199946403503418, 0.4876744747161865, 0.43846864700317384,
        0.4477973461151123, 0.114066743850708, 0.486585807800293, 0.43675496578216555,
        0.4586408376693726, 0.1126516342163086, 0.48655402660369873, 0.43992033004760744]

Available Devices

Default name: devices.yml.

This file is a mapping of device type names to a list of hosts. Each unique device type name entry is list of hosts (e.g., host names or IP addresses that the runtime can resolve).

For example:

DummyDeviceType: []
Minnowboard-E3845:
- mb-0
- mb-1
- mb-2
- mb-3
RCC-VE-C2558:
- rcc-0
- rcc-1
- rcc-2
- rcc-3

Development Notes

You may build the scheduler application manually, e.g., to experiment with it in isolation, but it will not be found by the runtime scheduler until the Python package is rebuilt/reinstalled.

To build manually:

mkdir src-native/build
cd src-native/build
cmake ..
cmake --build .