Fix "best effort" hf sharding now that we have fancy meshes #622

dlwh · 2024-06-11T06:36:16Z

Fixes #609 I think

dlwh · 2024-06-11T06:59:52Z

got a report it's working. @versae lemme know if it still gives you problems

rjpower · 2024-06-11T15:26:17Z

Dumb question, but why does this fix things? The original error was happening here:

levanter/src/levanter/compat/hf_checkpoints.py

Line 587 in 7bdd375

lev_model = load_from_state_dict(state_dict)

Is it this linehaliax.partitioning._get_mesh() -- we now pick up the default mesh from the parent context and use that instead of inferring a sharding?

(I know a lot about meshes but almost nothing about JAX meshes, so I was a bit confused why it threw an error originally instead of just (maybe) popping up a warning and reshuffling the naively sharded data to the end form. I'm guessing either it wants an explicit copy between meshes or that there's some "lower-level" mesh where it doesn't have the information to reshard anymore.)

dlwh · 2024-06-11T16:19:04Z

So... it's the _get_mesh.

It's kind of working around a problem in Haliax (which I'm now pretty sure is working around a problem in JAX) more than anything. named_jit takes three optional axis mapping arguments (input, output, context/compute), and expects a context mesh (I should probably make it take a mesh arg). https://github.com/stanford-crfm/haliax/blob/main/src/haliax/partitioning.py#L312-L327 . This is partially historical, for the pre-jax.Array era where arrays didn't know their shardings.

Now that they do, it ought to be the case that if input mapping isn't specified, it should just omit the input shardings. It should actually further be the case that we don't even use input_axis_mapping and just always preserve shardings. However, whenever I try to make that change, CPU tests fail when I use XLA_FLAGS=--xla_force_host_platform_device_count=8, and so I never pulled the trigger. I realized the other day this is probably a bug in JAX, since xla_force_host_platform_device_count is kind of an afterthought for debugging.

So what's happening is that we were using a different mesh than the "real one" and then telling jit that the shardings were using the "real mesh" down the road. This works around that by ensuring it's the same mesh... Gross but it fixes for three users and doesn't cause too much damage.

I'll see if i can do the real fix and just file a bug on the xla_force_host_platform_device_count thing.

rjpower · 2024-06-11T22:39:55Z

Ah interesting -- that makes sense -- thanks for the fix and explanation! I agree, the XLA CPU situation is always a bit of a gamble. It's great that it's there, but it definitely doesn't have the same functionality of the GPU/TPU side.

dlwh added 4 commits June 10, 2024 22:56

fix hf sharding with new fancy mesh stuff

82f91bb

sigh

8d2c5da

sigh

458042c

sigh

3ce507b

dlwh requested a review from blahBlahhhJ June 11, 2024 06:36

dlwh mentioned this pull request Jun 11, 2024

Incompatible devices for jitted computation #609

Closed

dlwh merged commit 7bdd375 into main Jun 11, 2024
5 checks passed

dlwh deleted the fix_hf_loading branch June 11, 2024 06:59

rjpower mentioned this pull request Jun 15, 2024

Sharding error when trying to serialize PEFT models #635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "best effort" hf sharding now that we have fancy meshes #622

Fix "best effort" hf sharding now that we have fancy meshes #622

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

rjpower commented Jun 11, 2024

dlwh commented Jun 11, 2024

rjpower commented Jun 11, 2024

Fix "best effort" hf sharding now that we have fancy meshes #622

Fix "best effort" hf sharding now that we have fancy meshes #622

Conversation

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

rjpower commented Jun 11, 2024

dlwh commented Jun 11, 2024

rjpower commented Jun 11, 2024