-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add recover model from checkpoint first if it exists, store is local #376
base: main
Are you sure you want to change the base?
add recover model from checkpoint first if it exists, store is local #376
Conversation
1d0091f
to
61159bf
Compare
🎉 Successfully Build Images. Overview: https://finops.coding.net/public-artifacts/gocrane/crane/packages
|
// 1. predictor checkpoints all metric namers together periodically by a independent routine. but this will not guarantee the checkpoint data is consistent with the latest updated model in memory | ||
// 2. predictor checkpoints the metric namer each time after model is updating, so the checkpoint is always latest. for example, the percentile to do checkpoint after add sample for each metric namer. | ||
// 3. application caller such as evpa triggers the metric namer to do checkpoint. delegate the trigger to application caller | ||
type Checkpointer interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move directory checkpoint into somewhere deeper, for example, in evpa's rootdirr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move directory checkpoint into somewhere deeper, for example, in evpa's rootdirr
I think check pointer is a common lib, not only for evpa. other model can be checkpointed too if it supports. so i put it in top level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice for a common checkpoint ability, So for a common lib, maybe we need to change interface function to a general way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the motivation of checkpoint should be clarified much clear; the current supported DSP algorithm actions in an off-line manner, i.e., no explicit training process; for those models having explicitly distinguished training and reasoning processes, checkpoint is much useful;
switch cpStoreType { | ||
case checkpoint.StoreTypeLocal: | ||
checkpointer, err = checkpoint.InitCheckpointer(checkpoint.StoreType(opts.CheckpointerStore), opts.CheckpointerLocalConfig) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the InitCheckpointer be called exactly once? e.g., by sync.Once.Do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the InitCheckpointer be called exactly once? e.g., by sync.Once.Do
it is ok to call InitCheckpointer more than once, becasue the factory pattern, there is a map to store each storage checkpoint. it will return if it already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
var _ Checkpointer = &Local{} | ||
|
||
// Local use local filesystem as checkpoint storage, so if you use Local, the craned need some persistent volumes such as cbs as storage to keep the states | ||
type Local struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just take the storage type as a configurable parameter? by that way, we can reuse the code among different types of storage, including the Start method below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just take the storage type as a configurable parameter? by that way, we can reuse the code among different types of storage, including the Start method below.
Yes, a high level abstraction is abstract a checkpoint manager framework for different storages, user can implement different storage if they want.
But now the checkpoint structure is not stable and is still evolving. and some algorithms do not support the model checkpoint. So we just iterate a little to support some urgent needs of users.
So now just implement different storage backed checkpoint. If you want, we can discuss more in crane meeting or wechat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #167
Special notes for your reviewer: