Support mixed precision training @open sesame 03/08 07:57 #2455

jihochu · 2024-02-02T06:47:38Z

It adds loss scale factor for removing invalid data while training.
The factor is dynamically calculated while gradient clipping step, And it initially disabled until
loss scale proeprty is set.
fc/pooling/conv2d/softmax layers are modified for loss scale and mixed tensor type.

Signed-off-by: Jiho Chu [email protected]

taos-ci · 2024-02-02T06:47:41Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2455. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

taos-ci · 2024-02-02T06:47:52Z

cibot: @jihochu, nntrainer/layers/pooling2d_layer.cpp does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

taos-ci · 2024-02-02T07:26:26Z

cibot: @jihochu, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2455-202402021547410.65952706336975-af2e8829e8e0ac70333370e438e9b7b37bc604f2/.

taos-ci · 2024-02-02T07:37:17Z

cibot: @jihochu, nntrainer/layers/pooling2d_layer.cpp does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

taos-ci · 2024-02-02T07:41:48Z

cibot: @jihochu, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2455-202402021637130.47981810569763-5ce56ff64b70e29561125de65169bff8ee06a41d/.

taos-ci · 2024-02-02T07:44:11Z

cibot: @jihochu, nntrainer/layers/pooling2d_layer.cpp does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

It checks derivative validation after backwarding, and apply gradient if derivative validation success. Signed-off-by: Jiho Chu <[email protected]>

clone method with tensor type is added for creating tensor with differenct datatype. And, some convenient methods for loss scale is added. Signed-off-by: Jiho Chu <[email protected]>

It adds tests for conv2d fp16 test. Signed-off-by: Jiho Chu <[email protected]>

It fixes doygen comments from clang format checker. Signed-off-by: Jiho Chu <[email protected]>

It installs loss_layer header file for custom loss layer. Signed-off-by: Jiho Chu <[email protected]>

It is assumed that activations and weight are fully compotaible, so it's unnecessary to be converted to. input layer and loss layres are different, cause input data and label data is assumed to be always float 32 type now. Signed-off-by: Jiho Chu <[email protected]>

It may get an invalid value for both internal tensor or gradient. This patch checks the validation of the data, and fix for it. Also, sscal api is replace with scopy for setZero, because it produces the invalid value if invalid input value is used. Signed-off-by: Jiho Chu <[email protected]>

taos-ci

@jihochu, 💯 All CI checkers are successfully verified. Thanks.

myungjoo · 2024-04-20T04:59:51Z

Recommentation:

Keep a PR with every related commits as a test basis and mark it "Do Not Merge" or "Draft PR"
Make a number of independent smaller PRs ("sub-PR of this PR") so that reviewers can actually read and understand, removing them from the "full PR", whenever they are merged. It is recommended to start with PRs with common data structures and interfaces without implementation.

jijoongmoon · 2024-04-25T02:14:09Z

nntrainer/layers/layer_node.cpp

+          if (num_w_opt_m > 0)
+            run_context->getWeightOptMasterVar(i, j).read(file);
+          else
+            run_context->getWeightOptVar(i, j).read(file);


This needs to be reversed. The base model data need to be saved in FP16. Not the FP32. We could read FP16 and save it to Master Wegith.

jijoongmoon · 2024-04-25T02:16:27Z

nntrainer/layers/lstm.cpp

-                                   weight_tensor_type);
+  TensorDim hidden_state_dim(batch_size, 1, max_timestep, unit,
+                             weight_tensor_type);
+  hidden_state_dim.setDataType(context.getActivationDataType());


This could be just TensorDim hidden_state_dim(batch_size, 1, max_timestep, unit, context.getActivationDataType()).

jijoongmoon · 2024-04-25T04:42:11Z

nntrainer/tensor/manager.cpp

-void Manager::deallocateWeights() { weight_pool.deallocate(); }
+void Manager::deallocateWeights() {
+  weight_pool.deallocate();
+  weight_master_pool.deallocate();


I don't think we need a separate pool for the weight master.

jijoongmoon · 2024-04-25T04:42:50Z

nntrainer/tensor/manager.cpp

+        dim_a.setDataType(act_type);
+        var = weight_pool.requestOrExtend(shared_name, dim_a, var_exec_order,
+                                          var_ls, t_initializer);
+        var_m = weight_master_pool.requestOrExtend(


I think tensor pool can manage if we just request to weight_pool.

jijoongmoon · 2024-04-25T04:51:15Z

nntrainer/tensor/manager.cpp

+        dim_a.setDataType(act_type);
+        var = weight_pool.request(name, dim_a, var_exec_order, var_ls,
+                                  t_initializer);
+        var_m = weight_master_pool.request(name, dim, var_exec_order, var_ls,


The execution order of var_m should be applyGradient_order only.

jijoongmoon · 2024-04-25T23:35:35Z

nntrainer/graph/network_graph.cpp

@@ -353,10 +363,15 @@ sharedConstTensors NetworkGraph::forwarding(
  bool training,
  std::function<void(std::shared_ptr<LayerNode>, bool)> forwarding_op,
  std::function<bool(void *userdata)> stop_cb, void *userdata) {
+
+  for (auto w : clip_weights) {


I wonder if we have to enable gradient clip property also true to use mixed precision training. I guess, this PR doesn't consider the case which enabled mixed + gradient clip.

This PR is to update the mixed precision layer. - integrate nnstreamer#2568 & nnstreamer#2455 - will update more test **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghak PARK <[email protected]>

jijoongmoon · 2024-11-11T07:07:43Z

closed by #2663

jihochu requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527, zhoonit, lhs8928, songgot, DonghakPark, SeoHyungjun, baek2sm, skykongkong8, djeong20, EunjuYang and a team as code owners February 2, 2024 06:47

github-actions bot added the Need Review label Feb 2, 2024

jihochu force-pushed the pr_mixed3 branch from af2e882 to 5ce56ff Compare February 2, 2024 07:37

jihochu force-pushed the pr_mixed3 branch from 5ce56ff to ce33c31 Compare February 2, 2024 07:44

jihochu force-pushed the pr_mixed3 branch from ce33c31 to 8284299 Compare February 2, 2024 07:47

jihochu added 7 commits April 2, 2024 12:33

[Graph] Use loss scale property

357a601

It checks derivative validation after backwarding, and apply gradient if derivative validation success. Signed-off-by: Jiho Chu <[email protected]>

[Tensor] Add several methods for mixed precision

2fd9207

clone method with tensor type is added for creating tensor with differenct datatype. And, some convenient methods for loss scale is added. Signed-off-by: Jiho Chu <[email protected]>

[Test] Add conv2d test for fp16

4f0447c

It adds tests for conv2d fp16 test. Signed-off-by: Jiho Chu <[email protected]>

[Fix] fix doxygen comments

e0efd10

It fixes doygen comments from clang format checker. Signed-off-by: Jiho Chu <[email protected]>

[packaging] install loss_layer.h

006c828

It installs loss_layer header file for custom loss layer. Signed-off-by: Jiho Chu <[email protected]>

jihochu force-pushed the pr_mixed3 branch from 8fd0e06 to 6e89fe6 Compare April 2, 2024 03:38

taos-ci approved these changes Apr 2, 2024

View reviewed changes

jijoongmoon mentioned this pull request Apr 18, 2024

[ layer ] Mixed precision forwarding / backwarding for bn layer @open sesame 03/07 10:42 #2462

Closed

jijoongmoon reviewed Apr 25, 2024

View reviewed changes

myungjoo mentioned this pull request May 3, 2024

[Property] Add loss scale property #2565

Closed

DonghakPark mentioned this pull request May 10, 2024

[Wait for #2568][Mixed] Mixed Precision Layer update #2579

Closed

DonghakPark mentioned this pull request Oct 30, 2024

[Wait for #2615] Enable Mixed Precision Training in NNTrainer @open sesame 11/09 15:18 #2663

Merged

jijoongmoon closed this Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mixed precision training @open sesame 03/08 07:57 #2455

Support mixed precision training @open sesame 03/08 07:57 #2455

jihochu commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci left a comment

myungjoo commented Apr 20, 2024 •

edited

Loading

jijoongmoon Apr 25, 2024

jijoongmoon Apr 25, 2024

jijoongmoon Apr 25, 2024

jijoongmoon Apr 25, 2024

jijoongmoon Apr 25, 2024

jijoongmoon Apr 25, 2024

jijoongmoon commented Nov 11, 2024

Support mixed precision training @open sesame 03/08 07:57 #2455

Support mixed precision training @open sesame 03/08 07:57 #2455

Conversation

jihochu commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci commented Feb 2, 2024

taos-ci left a comment

Choose a reason for hiding this comment

myungjoo commented Apr 20, 2024 • edited Loading

jijoongmoon Apr 25, 2024

Choose a reason for hiding this comment

jijoongmoon Apr 25, 2024

Choose a reason for hiding this comment

jijoongmoon Apr 25, 2024

Choose a reason for hiding this comment

jijoongmoon Apr 25, 2024

Choose a reason for hiding this comment

jijoongmoon Apr 25, 2024

Choose a reason for hiding this comment

jijoongmoon Apr 25, 2024

Choose a reason for hiding this comment

jijoongmoon commented Nov 11, 2024

myungjoo commented Apr 20, 2024 •

edited

Loading