Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于运行一段时间,机器断电,如何继续训练 #66

Open
GromZhang opened this issue Apr 2, 2024 · 2 comments
Open

关于运行一段时间,机器断电,如何继续训练 #66

GromZhang opened this issue Apr 2, 2024 · 2 comments

Comments

@GromZhang
Copy link

如标题, 在进行预训练的过程中,我使用的服务器发生了异常。我该如何继续进行预训练,请各位老师指点一下。

@PshySimon
Copy link

PshySimon commented Apr 6, 2024

每隔一定训练steps就保存模型checkpoint,训练的参数以及优化器的参数,pytorch提供了torch.save(model.state_dict, path), model.load_state_dict()接口,可以保存这些参数

@wdndev
Copy link

wdndev commented May 1, 2024

可以看一下这个项目,使用transformers库进行训练,支持断点训练,zero等优化技术。
https://github.com/wdndev/tiny-llm-zh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants