Skip to content

Latest commit

 

History

History

experiments

Experiments and Results

The following is mostly outdated and should be recycled/updated:

Model

  • Basically GPT, but with slightly different attention/loss mask (see agenda above)
  • Targeting 175-200B parameters, seq length 512 or 1024, adam optimizer
  • Note: our nodes have 128GB total gpu memory each -> need sharding (model weights alone are 350GB in fp16!)
  • Note 2: may need to adjust expectations based on estimated training time (pending)
  • Subsequent runs: consider encoder-decoder arch? longer sequences? adafactor?

Making it train:

  • Candidates for run1: deepspeed-gpt and megatron-gpt. Save fairscale/t5/custom for later. Use reference training code instead of integrating.
  • Only consider training nodes with 4x V100-32GB because there are too few 8x nodes in JZ
  • Current strategy: fully sharded data-parallel between nodes, arbitrary black magic within one node.

For each framework, we need to measure (hackathon participants + Sidd + Laurel)

  • maximum number of parameters that we can train on 1 node
  • minimum number of nodes to fit 175B-parameter model (num_layers/dmodel/dhead: from gpt3 table 2.1)
  • training throughput (i.e. training tokens per second in both configurations)
  • Once we get through the initial debugging, ask microsoft/nvidia teams to sanity-check our configs for DS/Megatron respectively

Investigate:

  • @Stas measured allreduce bandwidth between two nodes as ~135Gbps, now he's is investigating if we can get closer to the theoretical max = 400Gbps
  • Estimated training time for 175B parameters - based on training throughput in megatron/deepspeed tech reports; estimated time to "reasonable performance". (TODO @Yozh and @Max Ryabinin). Sanity-check whether the 8-node servers are indeed too few to train in reasonable time.
  • Figure out the how to introduce semi-autoregressive attention mask (and loss mask) to the code of deepspeed and megatron-LM training configs (looking for volunteers!)

Next steps:

  • After Stas figures out the bandwidth, we need to measure how performance scales with #nodes using data-parallel training.
  • Ask/measure how many server nodes can we reliably provision with JZ. Find the optimal trade-off between #nodes vs training time.
  • a quick preliminary run with the same configuration, but fewer parameters ... and we can unleash the beast!