You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For such cases, it is recommended that both iterators (nested and outer) are wrapped by flor.loop, since the nested flor.loop can act as a Skip Block on replay, skipping the GPU-intensive computations and loading the state from checkpoint.
However, as FlorDB grows to span more applications than model training (e.g. data ingestion, featurization, feedback integrations, etc), we will encounter cases with a single main loop that may nevertheless require checkpointing across iterations. For example:
classAggregator:
def__init__(self):
self.state= []
defupdate(self, data_chunk):
# Perform some processing and update the stateprocessed_data=complex_transformation(data_chunk)
self.state.append(processed_data)
defget_state(self):
returnself.statewithflor.checkpointing(aggregator=Aggregator()) aschk_set:
forchunk_id, data_chunkinflor.loop("chunk", enumerate(data_chunks)):
flor.log("chunk_id", chunk_id)
# Update the object with the new chunkchk_set.aggregator.update(data_chunk)
# Log the state of the object for checkpointingflor.log("status", "complete")
# Finalize the processing with the modified objectfinal_result=validate_ingestion(chk_set.aggregator)
In PyTorch, the model training loop is doubly-nested: a loop traversing data batch by batch is nested inside an epoch loop. e.g.:
For such cases, it is recommended that both iterators (nested and outer) are wrapped by
flor.loop
, since the nestedflor.loop
can act as a Skip Block on replay, skipping the GPU-intensive computations and loading the state from checkpoint.However, as FlorDB grows to span more applications than model training (e.g. data ingestion, featurization, feedback integrations, etc), we will encounter cases with a single main loop that may nevertheless require checkpointing across iterations. For example:
Thanks to @xllgit for identifying this issue.
The text was updated successfully, but these errors were encountered: