Make sharding_tolerance configurable #1058

Doris26 · 2024-11-21T23:43:26Z

Using pure (512 DCN) FSDP triggers MaxText error of "Number of unsharded parameters exceeds tolerance 2% of total parameters."

Make tolerance a configurable param to avoid future errors across certain machines setups.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

google-cla · 2024-11-21T23:43:30Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gobbleturk

Thanks for this idea! Some small refactoring:

Ideally we keep train.py as small as possible - and have this tolerance configurable in the typical config way base.yml

gobbleturk · 2024-12-04T01:02:04Z

MaxText/train.py

@@ -71,6 +71,7 @@
 Transformer = models.Transformer
 EPS = 1e-8
 _DEFAULT_OCDBT_TARGET_DATA_FILE_SIZE = 2 * 1024**3
+_DEFAULT_TOLERANCE = 0.02


can you move this to a field in base.yml named e.g. "sharded_tolerance" defaulting to 0.02

gobbleturk · 2024-12-04T01:02:31Z

MaxText/train.py

@@ -82,6 +83,8 @@ def validate_train_config(config):
  if not config.base_output_directory.startswith("gs://"):
    max_logging.log("WARNING: 'base_output_directory' might be pointing your local file system")
  assert config.steps > 0, "You must set steps or learning_rate_schedule_steps to a positive integer."
+  if "tolerance" in config.__dict__ and (config.tolerance > 1.0 or config.tolerance < 0.0):


Can you move this check to pyconfig.py? Ideally we keep train.py as small as possible

gobbleturk · 2024-12-04T01:03:23Z

MaxText/train.py

@@ -550,14 +553,15 @@ def setup_train_loop(config):
  record_goodput(recorder, config, recorder.record_tpu_init_end_time if recorder else None)
  record_goodput(recorder, config, recorder.record_training_preparation_start_time if recorder else None)
  data_iterator, eval_data_iterator = create_data_iterator(config, mesh)
+  tolerance = config.tolerance if "tolerance" in config.__dict__ else _DEFAULT_TOLERANCE


this shouldn't be necessary when tolerance is part of config

gobbleturk · 2024-12-04T01:03:39Z

MaxText/train.py


  state, _, state_mesh_shardings, data_iterator = max_utils.setup_training_state(
      model, data_iterator, tx, config, init_rng, mesh, checkpoint_manager
  )

  if not config.using_pipeline_parallelism:
    # The vocab tensor(s) of shape [vocab, embed] (and transpose) are not sharded by stage
-    maxtext_utils.assert_params_sufficiently_sharded(state.params, mesh, tolerance=0.02)
+    maxtext_utils.assert_params_sufficiently_sharded(state.params, mesh, tolerance)


you can use config.sharded_tolerance instead of tolerance here

gobbleturk

Thanks for addressing feedback!

Doris26 requested review from gobbleturk, jonb377, khatwanimohit, bvandermoon and vipannalla as code owners November 21, 2024 23:43

Doris26 changed the title ~~Tolerance configurable~~ Make tolerance configurable Nov 21, 2024

Doris26 changed the title ~~Make tolerance configurable~~ Make sharding_tolerance configurable Dec 3, 2024

gobbleturk requested changes Dec 4, 2024

View reviewed changes

Doris26 requested a review from gobbleturk December 5, 2024 03:43

gobbleturk approved these changes Dec 5, 2024

View reviewed changes

Doris26 force-pushed the tolerance_configurable branch 3 times, most recently from 737a915 to 614fb0b Compare December 9, 2024 22:10

github-actions bot added the pull ready label Dec 9, 2024

make sharding_tolerance configurable

2e0ac9d

Doris26 force-pushed the tolerance_configurable branch from 614fb0b to 2e0ac9d Compare December 9, 2024 22:58

copybara-service bot merged commit 86d85e4 into main Dec 10, 2024
14 checks passed

copybara-service bot deleted the tolerance_configurable branch December 10, 2024 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sharding_tolerance configurable #1058

Make sharding_tolerance configurable #1058

Doris26 commented Nov 21, 2024 •

edited

Loading

google-cla bot commented Nov 21, 2024

gobbleturk left a comment

gobbleturk Dec 4, 2024

gobbleturk Dec 4, 2024

gobbleturk Dec 4, 2024

gobbleturk Dec 4, 2024

gobbleturk left a comment

Make sharding_tolerance configurable #1058

Make sharding_tolerance configurable #1058

Conversation

Doris26 commented Nov 21, 2024 • edited Loading

Checklist

google-cla bot commented Nov 21, 2024

gobbleturk left a comment

Choose a reason for hiding this comment

gobbleturk Dec 4, 2024

Choose a reason for hiding this comment

gobbleturk Dec 4, 2024

Choose a reason for hiding this comment

gobbleturk Dec 4, 2024

Choose a reason for hiding this comment

gobbleturk Dec 4, 2024

Choose a reason for hiding this comment

gobbleturk left a comment

Choose a reason for hiding this comment

Doris26 commented Nov 21, 2024 •

edited

Loading