Partially-sharded-data-parallel #588

blahBlahhhJ · 2024-05-16T03:49:35Z

An explanation:

Here I will ignore the difference between DP and FSDP and assume there're only 2 axis (DP, TP) because I handled all DP, FSDP axis together DP_FSDP_index = DP_index * FSDP_size + FSDP_index

Suppose we have this mesh of DP=n, TP=4: (d -> device. p-> process)

         TP0    TP1     TP2      TP3
DP0    (d0 p0) (d1 p0) (d4 p1) (d5 p1)
DP1    (d8 p2) (d9 p2) (d12 p3) (d13 p3)
  ... a bunch of other DP groups in between so that devices are non-contiguous
DPk     (d2 p0) (d3 p0) (d6 p1) (d7 p1)

The 4 devices (e.g. d0d1d4d5) in every rows gets the same data and performs TP (shards the model). Each column receives different data.

We first take care of each process individually and ensure

devices in the same DP group (with different TP index) receives the same data
devices in different DP groups receive different data

This is handled by local_device_mapping().

From each device, we can extract its DP index by its position in the mesh and map it to a uid. In this example, we will have a mapping of

p0: {0: 0, k: 1}
p1: {0: 0, k: 1}
p2: {1: 0, ...}
p3: {1: 0, ...}

When we call make_array_from_callback() with the mesh, each device gets a slice of size per_device_batch_size and thus the slice's start will be global_begin = DP_index * per_device_batch_size. Thus we can extract DP_index from the slice

DP_index = global_begin // per_device_batch_size

See batch_callback() in loader.py, devices will get

local_batch[uid*per_device_batch_size : (uid+1)*per_device_batch_size]

For p0, this means devices with DP index 0 (d0&d1) will get the first half of local_batch, and devices with DP index k (d2&d3) will get the second half. This satisfied the two bullet points above.

Next, we take care of the process-level stuff and ensure

different processes' devices that are in the same DP group (different TP index) receive the same data.
different processes' devices that are in different DP groups receive different data.

This is handled by process_mesh_mapping().

Each process:

maps to its upper left device in the mesh
maps to the location of the device
maps to the location of the device with TP0 in the same DP group
maps to the device in that TP0 location
maps to the process of that device
maps to a uid

p0 -> d0 -> (DP0 TP0) -> (DP0 TP0) -> d0 -> p0 -> 0
p1 -> d4 -> (DP0 TP2) -> (DP0 TP0) -> d0 -> p0 -> 0
p2 -> d8 -> (DP1 TP0) -> (DP1 TP0) -> d8 -> p2 -> 1
p3 -> d12-> (DP1 TP2) -> (DP1 TP0) -> d8 -> p2 -> 1

Thus we get this mapping

{p0: 0, p1: 0, p2: 1, p3: 1}

The uid becomes the shard_idx of the dataloader. Thus, p0 and p1 as processes will get the same shard_idx because and will get the same local_batch.

When looking at device-level, we already ensure that d0&d1 gets the first half, d2&d3 gets the second half. This is also true for p1: d4&d5 gets the first half and d6&d7 gets the second half. Now because p0 and p1 have the same shard_idx, we further ensure that d0&d1&d4&d5 gets the same data, d2&d3&d6&d7 gets the same data.

Therefore, these 2 mapping function works for any DP/TP configuration :)

README.md

docs/Configuration-Guide.md

src/levanter/data/loader.py

dlwh

I'm confused. Can you explain (as a comment) why the data uid stuff is correct?

src/levanter/trainer.py

src/levanter/data/loader.py

dlwh · 2024-05-27T04:26:30Z

src/levanter/data/loader.py

        self.item_dataset = local_dataset.shard(process_data_pos, num_data_process_groups)
        super().__init__(max_capacity, axis_resources)

    def _produce_batches(self) -> Iterator[PyTree]:
        one_item_generator = non_caching_cycle(self.item_dataset)
        batched = _batched(one_item_generator, self.local_batch_size)

+        def batch_callback(global_begin, _):
+            # global_begin is uid for DP/FSDP
+            # DP_id * per_device_bs = global_begin


could you add a note explaining why this is correct (also so I can be sure I understand)

updated the PR description.

src/levanter/mesh.py

dlwh · 2024-05-27T04:29:30Z

src/levanter/mesh.py

-    If we envision each process as a subgrid of the mesh for its devices, then there is a process grid that
-    is a coarsened version of the mesh. This is the size of the process grid.
+    Handles the case when different processes share the same data in TP.
+    If we envision each process as a subgrid of the mesh for its devices, this is the position of the process


i don't actually understand how process_mesh is still a valid abstraction/idea in the world with "non-contiguous" devices meshes

updated the PR description.

dlwh · 2024-05-27T04:30:32Z

also can you merge main so that the TPU tests run

src/levanter/data/loader.py

dlwh

ok i think i'm convinced!

blahBlahhhJ added 19 commits May 11, 2024 20:32

refactor parallelism

79815c0

small fix

69ce2e7

add config

e1d1b51

small fix

0112e87

small fix

d119af4

small fix

c0e25fd

allow splitting logical axes

d71e9ca

flatten splash attention axes

54f407e

add print deebug

e0df652

remove debug

ea86cf7

simple mesh

61b70fd

revert auto mesh

2ecf8c6

patch for non-TP

71f7d5b

remove print

8ecd091

small fix

67cf17f

patch process groups size as well

e2f7794

small fix

ec7e41c

tries simple fix

07a8b17

revert

4d61503

dlwh reviewed May 16, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

docs/Configuration-Guide.md Outdated Show resolved Hide resolved

src/levanter/data/loader.py Outdated Show resolved Hide resolved

blahBlahhhJ force-pushed the psdp branch from 1812f7c to 4d61503 Compare May 17, 2024 22:06

blahBlahhhJ added 6 commits May 17, 2024 15:14

use normal mesh for sharding data loader

89d18c8

bump haliax version

803df3d

fix shard loader logic

02a1ed5

small fix

479d318

small fix

019d516

small fix

c8d48f5

blahBlahhhJ force-pushed the psdp branch from 96fd096 to c8d48f5 Compare May 23, 2024 21:57

blahBlahhhJ added 2 commits May 23, 2024 15:14

small fix

642a377

add print

8f9c884

blahBlahhhJ added 14 commits May 23, 2024 16:11

small fix

27eb7fa

remove print

2a459e7

add process mapping

5db35c3

small fix

6536c54

small fix

abcd5d2

small fix

3313002

small fix

c34a573

small fix

2331ae8

small fix

8b7b804

rename model axis size

130c212

clean up

68ed94e

clean up

3bdf5ba

modify tests

f7dfc64

clean up

3787610

blahBlahhhJ marked this pull request as ready for review May 26, 2024 20:41

blahBlahhhJ requested a review from dlwh May 26, 2024 20:41

dlwh reviewed May 27, 2024

View reviewed changes

src/levanter/data/loader.py Outdated Show resolved Hide resolved

blahBlahhhJ added 3 commits May 26, 2024 22:18

clean up

24fbf37

Merge branch 'main' into psdp

4a63e09

clean up

843f5a9

dlwh approved these changes May 28, 2024

View reviewed changes

dlwh merged commit ed3c6f1 into main May 28, 2024
5 checks passed

dlwh deleted the psdp branch May 28, 2024 06:29

dlwh mentioned this pull request May 28, 2024

Ensure microbatching works with Partially Sharded Data Parallel #325

Open

rjpower pushed a commit to rjpower/levanter that referenced this pull request May 29, 2024

Partially-sharded-data-parallel (stanford-crfm#588)

c190339

Ivan-Zhou pushed a commit that referenced this pull request May 29, 2024

Partially-sharded-data-parallel (#588)

6029b29

dlwh mentioned this pull request Jun 1, 2024

Incompatible devices for jitted computation #609

Closed

dlwh mentioned this pull request Jun 12, 2024

FSDP with an odd/weird number of GPUs. #429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partially-sharded-data-parallel #588

Partially-sharded-data-parallel #588

blahBlahhhJ commented May 16, 2024 •

edited

Loading

dlwh left a comment

dlwh May 27, 2024

blahBlahhhJ May 27, 2024

dlwh May 27, 2024

blahBlahhhJ May 27, 2024

dlwh commented May 27, 2024

dlwh left a comment

Partially-sharded-data-parallel #588

Partially-sharded-data-parallel #588

Conversation

blahBlahhhJ commented May 16, 2024 • edited Loading

dlwh left a comment

Choose a reason for hiding this comment

dlwh May 27, 2024

Choose a reason for hiding this comment

blahBlahhhJ May 27, 2024

Choose a reason for hiding this comment

dlwh May 27, 2024

Choose a reason for hiding this comment

blahBlahhhJ May 27, 2024

Choose a reason for hiding this comment

dlwh commented May 27, 2024

dlwh left a comment

Choose a reason for hiding this comment

blahBlahhhJ commented May 16, 2024 •

edited

Loading