Separate progress for fine-tuning and inferencing #477

Nateowami · 2024-09-04T15:30:52Z

Background

Currently a TranslationBuild has a PercentCompleted property that estimates the overall completion of finetuning + inferencing. I would like a separate progress for each step so that we can break them out separately for the user.

Proposal

Add these properties to the TranslationBuild:

public double? FineTuneProgress
public double? InferenceProgress

Both values should range from 0 to 1, and should be null when the associated process has not been started. You could pick other names, but I would ask that "percent" not be used in the name, because it is not actually a percentage (the misleading name of PercentCompleted resulted in progress being shown incorrectly for a short time).

The text was updated successfully, but these errors were encountered:

Enkidu93 · 2024-11-26T01:28:31Z

@johnml1135, this might be another good next issue for @mudiagaobrikisil (?).

johnml1135 · 2024-11-26T17:01:46Z

Hmmm... there are a few ways to go about this:
Option 1: Simple

We know that on an A100, it takes around 1.6 sec/training step and 0.5 sec/pretranslation. We would know at the beginning of the job how many training steps and how many segments to pretranlate. We can do a simple calculation and split up the phases of the job appropriately:
Example: 5000 steps and 10,000 segments -> 8000 seconds for training and 5000 seconds for pretranslating ->
- Training would be 61.5% and pretranslating would be 28.5%.
We can also give an estimate of when the job will complete (calculated in Serval) and put it in the new statistics/runData field.

Option 2: Give the real data

This would require substantially more data piping changes, feeding the actual numbers up to ClearML and then retrieving them from Serval and then feeding them up through the new statistics/runData field.

johnml1135 · 2024-12-05T18:44:26Z

Option 3: Pass a Json string in the existing message in this format

{
  "startTime": <unix timestamp>,
  "estimatedEndTime": <unix timestamp>, 
  "currentPhaseIndex": 1,  // zero indexed
  "phases":
  [
    {
      "name": "Training",
      "totalSteps": 5000,
      "currentStep": 5000,
    },
    {
      "name": "Pretranslation",
      "totalSteps": 5000,
      "currentStep": 5000,
    }
  ]
}

@ddaspit - @Nateowami and I came up with this scheme to get detailed data to SF with minimal Serval changes.
@pmachapman - do you have any thoughts or proposed changes?

pmachapman · 2024-12-08T22:34:22Z

@johnml1135 Would you add a new property and have its type as this new object?

johnml1135 · 2024-12-09T15:47:23Z

Here are the API level options that I can think of:
Option 1 - repurpose Build message

Replace the "message" field getBuild with a Json string with lots of data

Option 2 - Add new object

Add new fields to the Build object, detailing every desired statistic

Option 3 - extend "ExecutionData":

A new field "ExecutionData" is being added (under review) that is a dict<string,sting>. It could handle some or all of this data.

Option 4 - hybrid:

Add most fields to executionData, but also create a new field executionPhases as an array similar to the one above. This could hold just the static "this is the plan" data while executionData holds the current state in fields such as currentPhaseIndex and currentStepInCurrentPhase along with the other meta-data such as estimatedEndTime.

@pmachapman - you will likely be writing the code to interpret and utilize the data. What would be best for you?

johnml1135 · 2024-12-09T16:55:24Z

@mudiagaobrikisil - you can start the calculation of the more accurate phases. Please talk to @mshannon-sil about how to get setup with ClearML sessions to utilize a GPU - or on how to start the task of from locally.

pmachapman · 2024-12-09T18:45:59Z

@johnml1135
Option 1 will break compatibility, as we already retrieve the message property.

I think options 2-4 are preferable - which ever is easier to implement or fits the existing architecture better. Perhaps the only problem with option 3 is that key names may change? (I'm a strongly typed kinda guy...)

Option 2 would be the easiest for me, as you would likely have a new version of the Serval.Client library anytime you add more fields to it, and these would be able to be documented in the code docs for the new properties, etc.

ddaspit · 2024-12-11T14:37:16Z

We should go with option 2.

Nateowami added sf_watching Scripture Forge should be updated when this is resolved or updated enhancement New feature or request labels Sep 4, 2024

johnml1135 self-assigned this Nov 1, 2024

johnml1135 added this to Serval Nov 1, 2024

github-project-automation bot moved this to 🆕 New in Serval Nov 1, 2024

johnml1135 assigned mudiagaobrikisil and unassigned johnml1135 Nov 26, 2024

johnml1135 assigned johnml1135 and unassigned mudiagaobrikisil Dec 4, 2024

johnml1135 assigned mudiagaobrikisil Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate progress for fine-tuning and inferencing #477

Separate progress for fine-tuning and inferencing #477

Nateowami commented Sep 4, 2024

Enkidu93 commented Nov 26, 2024

johnml1135 commented Nov 26, 2024

johnml1135 commented Dec 5, 2024 •

edited

Loading

pmachapman commented Dec 8, 2024

johnml1135 commented Dec 9, 2024 •

edited

Loading

johnml1135 commented Dec 9, 2024

pmachapman commented Dec 9, 2024

ddaspit commented Dec 11, 2024

Separate progress for fine-tuning and inferencing #477

Separate progress for fine-tuning and inferencing #477

Comments

Nateowami commented Sep 4, 2024

Background

Proposal

Enkidu93 commented Nov 26, 2024

johnml1135 commented Nov 26, 2024

johnml1135 commented Dec 5, 2024 • edited Loading

pmachapman commented Dec 8, 2024

johnml1135 commented Dec 9, 2024 • edited Loading

johnml1135 commented Dec 9, 2024

pmachapman commented Dec 9, 2024

ddaspit commented Dec 11, 2024

johnml1135 commented Dec 5, 2024 •

edited

Loading

johnml1135 commented Dec 9, 2024 •

edited

Loading