Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] RuntimeError when running instruction fine-tuning on mistral 7b, Sagemaker Jumpstart #4649

Open
louishourcade opened this issue May 3, 2024 · 2 comments

Comments

@louishourcade
Copy link

louishourcade commented May 3, 2024

Link to the notebook
https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/mistral-7b-instruction-domain-adaptation-finetuning.ipynb

Describe the bug
I get an error when I run the training step for instruction fine-tuning in this notebook. The training job starts properly, but after ~10min it fails and raises: ErrorMessage "raise RuntimeError( RuntimeError: Could not find response key [1, 32002] in token IDs tensor([ 1, 20811, 349, ..., 302, 15637, 266])

To reproduce

  • Upload the notebook in a Sagemaker Notebook
  • Run every cell, the error appears when running the instruction-fine tuning training job (1.3 Starting Training section)

Logs
Attaching some screenshots of the logs

Screenshot 2024-05-03 at 16 45 43

Screenshot 2024-05-03 at 16 47 10

Any idea on how to fix this ?

@louishourcade louishourcade changed the title [Bug Report] [Bug Report] RuntimeError when running instruction fine-tuning on mistral 7b, Sagemaker Jumpstart May 3, 2024
@prakash5801
Copy link

@louishourcade: Facing same issue while running the example notebook from AWS. Did you find the solution?

@louishourcade
Copy link
Author

Hi @prakash5801, no I didn't find time to investigate more. But I saw yesterday that the error is still there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants