Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to current_training_job_name before .train() #5047

Open
discort opened this issue Feb 18, 2025 · 2 comments
Open

Access to current_training_job_name before .train() #5047

discort opened this issue Feb 18, 2025 · 2 comments
Labels
component: model Relates to SageMaker Model

Comments

@discort
Copy link

discort commented Feb 18, 2025

Describe the feature you'd like
I want to keep training artifacts and tensorboard logs for a training job in the same s3 folder.

How would this feature be used? Please describe.
This feature allows to keep my artifacts and tensorboard logs organized. For instance, I can easily find my logs by a job name.

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute, TensorBoardOutputConfig

image = "<image>"

source_code = SourceCode(
    source_dir="code",
    command="python train.py"
)
compute = Compute(
   instance_count=1,
   instance_type="ml.g5.8xlarge"
)

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
	compute=compute,
)
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path="s3://<default_bucket>/<base_job_name>/<base_job_name-timestamp>/tensorboard",
    local_path="/opt/ml/output/tensorboard",
)
model_trainer = model_trainer.with_tensorboard_output_config(tensorboard_output_config)
model_trainer.train()

results on s3://:

- default_bucket
    - base_job_name
        - base_job_name-<timestamp>
            - artifacts
            - tensorboard

Describe alternatives you've considered
The only alternative that's coming to my mind is using timestamp in base_job_name. However, the drawback of this approach results in getting unpleasant training job name like base_job_name-<my-timestamp>-<generated-timestamp>

Additional context

@discort
Copy link
Author

discort commented Feb 18, 2025

cc @benieric

@rsareddy0329 rsareddy0329 added the component: model Relates to SageMaker Model label Feb 28, 2025
@benieric
Copy link
Contributor

benieric commented Feb 28, 2025

Hi @discort, I wonder if better solution would be to have the TensorBoardOutputConfig have the s3_output_path and local_path be optional.

By default, ModelTrainer could set the s3_output_path to follow same contract as the rest of artifacts like:

- default_bucket
    - base_job_name
        - base_job_name-<timestamp>
            - artifacts
            - tensorboard

So user could provide a TensorBoardOutputConfig() directly without manually being required to set up the paths explicitly.

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
	compute=compute,
)

model_trainer = model_trainer.with_tensorboard_output_config(TensorBoardOutputConfig())
model_trainer.train()

This way user will be able to call .train() multiple times consecutively. If we resolved the full unique training job name once during initialization of ModelTrainer .train() would only be able to get called once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: model Relates to SageMaker Model
Projects
None yet
Development

No branches or pull requests

3 participants