Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a Colab notebook as TPU playground #24

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tianjianjiang
Copy link
Collaborator

@tianjianjiang tianjianjiang commented Sep 3, 2021

Known Issues

I haven't really started debugging issues below.

CPU/GPU

This is caused by the breaking change of torch 1.9.0, so downgrading to torch 1.8.1 (or perhaps 1.8.2) is necessary.

training:   0% 0/50 [00:00<?, ?it/s]wandb: W&B syncing is set to `offline` in this directory.  Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Traceback (most recent call last):
  File "bsmetadata/train.py", line 195, in main
    loss = loss_fn(batch, outputs, metadata_mask)
  File "bsmetadata/train.py", line 83, in loss_fn
    b = outputs.logits.size(0)
AttributeError: 'NoneType' object has no attribute 'size'

TPU

I haven't figured this out but it seems not breaking the outcome (to wandb).

eval: 100%|███████████████████████████████████████| 7/7 [00:05<00:00,  1.30it/s]
tcmalloc: large alloc 1099718656 bytes == 0x55f2c54e8000 @  0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17aa5b484 0x55f17a9c769c 0x55f17a9c620a 0x55f17a9c691e 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
tcmalloc: large alloc 1374650368 bytes == 0x55f307626000 @  0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17aa5aa71 0x55f17aa5b5a2 0x55f17a9cd423 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c691e 0x55f17a9c68d1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e
tcmalloc: large alloc 1718312960 bytes == 0x55f35951e000 @  0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c6968 0x55f17a9c68d1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
tcmalloc: large alloc 2147893248 bytes == 0x55f3bfbd4000 @  0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c6968 0x55f17a9c68d1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
tcmalloc: large alloc 2684870656 bytes == 0x55f43fc38000 @  0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c6968 0x55f17a9c69b1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
    args.func(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 384, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 142, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'bsmetadata/train.py', 'max_train_steps=50', 'num_eval=1', 'data_config.experiment=without_metadata', 'data_config.per_device_eval_batch_size=4', 'data_config.train_file=/content/drive/MyDrive/colab_data/bigscience/cc_news.jsonl', 'data_config.validation_split_percentage=1']' died with <Signals.SIGKILL: 9>.
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
  len(cache))

Two side notes:
1. Python 3.7.11 is for Colab;
2. Poetry is optional for managing venv and dependencies, but syncing
   with requirements(-dev).txt must be done manually for the time being.
@tianjianjiang tianjianjiang force-pushed the feat-add_colab_tpu_playground branch 3 times, most recently from e72b724 to 2cb0de9 Compare September 8, 2021 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant