Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate progress for fine-tuning and inferencing #477

Open
Nateowami opened this issue Sep 4, 2024 · 8 comments
Open

Separate progress for fine-tuning and inferencing #477

Nateowami opened this issue Sep 4, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request sf_watching Scripture Forge should be updated when this is resolved or updated

Comments

@Nateowami
Copy link
Collaborator

Background

Currently a TranslationBuild has a PercentCompleted property that estimates the overall completion of finetuning + inferencing. I would like a separate progress for each step so that we can break them out separately for the user.

Proposal

Add these properties to the TranslationBuild:

  • public double? FineTuneProgress
  • public double? InferenceProgress

Both values should range from 0 to 1, and should be null when the associated process has not been started. You could pick other names, but I would ask that "percent" not be used in the name, because it is not actually a percentage (the misleading name of PercentCompleted resulted in progress being shown incorrectly for a short time).

@Nateowami Nateowami added sf_watching Scripture Forge should be updated when this is resolved or updated enhancement New feature or request labels Sep 4, 2024
@johnml1135 johnml1135 self-assigned this Nov 1, 2024
@johnml1135 johnml1135 added this to Serval Nov 1, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Serval Nov 1, 2024
@Enkidu93
Copy link
Collaborator

@johnml1135, this might be another good next issue for @mudiagaobrikisil (?).

@johnml1135
Copy link
Collaborator

Hmmm... there are a few ways to go about this:
Option 1: Simple

  • We know that on an A100, it takes around 1.6 sec/training step and 0.5 sec/pretranslation. We would know at the beginning of the job how many training steps and how many segments to pretranlate. We can do a simple calculation and split up the phases of the job appropriately:
  • Example: 5000 steps and 10,000 segments -> 8000 seconds for training and 5000 seconds for pretranslating ->
    • Training would be 61.5% and pretranslating would be 28.5%.
  • We can also give an estimate of when the job will complete (calculated in Serval) and put it in the new statistics/runData field.

Option 2: Give the real data

  • This would require substantially more data piping changes, feeding the actual numbers up to ClearML and then retrieving them from Serval and then feeding them up through the new statistics/runData field.

@johnml1135
Copy link
Collaborator

johnml1135 commented Dec 5, 2024

Option 3: Pass a Json string in the existing message in this format

{
  "startTime": <unix timestamp>,
  "estimatedEndTime": <unix timestamp>, 
  "currentPhaseIndex": 1,  // zero indexed
  "phases":
  [
    {
      "name": "Training",
      "totalSteps": 5000,
      "currentStep": 5000,
    },
    {
      "name": "Pretranslation",
      "totalSteps": 5000,
      "currentStep": 5000,
    }
  ]
}

@ddaspit - @Nateowami and I came up with this scheme to get detailed data to SF with minimal Serval changes.
@pmachapman - do you have any thoughts or proposed changes?

@pmachapman
Copy link
Collaborator

@johnml1135 Would you add a new property and have its type as this new object?

@johnml1135
Copy link
Collaborator

johnml1135 commented Dec 9, 2024

Here are the API level options that I can think of:
Option 1 - repurpose Build message

  • Replace the "message" field getBuild with a Json string with lots of data

Option 2 - Add new object

  • Add new fields to the Build object, detailing every desired statistic

Option 3 - extend "ExecutionData":

  • A new field "ExecutionData" is being added (under review) that is a dict<string,sting>. It could handle some or all of this data.

Option 4 - hybrid:

  • Add most fields to executionData, but also create a new field executionPhases as an array similar to the one above. This could hold just the static "this is the plan" data while executionData holds the current state in fields such as currentPhaseIndex and currentStepInCurrentPhase along with the other meta-data such as estimatedEndTime.

@pmachapman - you will likely be writing the code to interpret and utilize the data. What would be best for you?

@johnml1135
Copy link
Collaborator

@mudiagaobrikisil - you can start the calculation of the more accurate phases. Please talk to @mshannon-sil about how to get setup with ClearML sessions to utilize a GPU - or on how to start the task of from locally.

@pmachapman
Copy link
Collaborator

@johnml1135
Option 1 will break compatibility, as we already retrieve the message property.

I think options 2-4 are preferable - which ever is easier to implement or fits the existing architecture better. Perhaps the only problem with option 3 is that key names may change? (I'm a strongly typed kinda guy...)

Option 2 would be the easiest for me, as you would likely have a new version of the Serval.Client library anytime you add more fields to it, and these would be able to be documented in the code docs for the new properties, etc.

@ddaspit
Copy link
Contributor

ddaspit commented Dec 11, 2024

We should go with option 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sf_watching Scripture Forge should be updated when this is resolved or updated
Projects
Status: 🆕 New
Development

No branches or pull requests

6 participants