Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Embedding] Add GPU fused embedding ops. #64

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
a566c1c
Update fused embedding modelzoo perf benchmark
RandyInterfish Jan 10, 2022
4567918
Update perf benchmark under docs
RandyInterfish Jan 10, 2022
222e123
Merge branch 'main' of https://github.com/nvzhou/DeepRec-public into …
RandyInterfish Jan 10, 2022
fe129e8
Add nvtx to fused embedding ops
RandyInterfish Jan 14, 2022
13906e4
refactor a little big
RandyInterfish Jan 20, 2022
95a08ff
Update: add unique sub op and use cu event to sync
RandyInterfish Jan 27, 2022
f2ca5ea
Minor change
RandyInterfish Feb 9, 2022
1cf4c57
minor update
RandyInterfish Feb 10, 2022
22e936e
minor update
RandyInterfish Feb 11, 2022
3fab0e8
Update: minor change
RandyInterfish Feb 14, 2022
d9765a5
temp update
RandyInterfish Feb 14, 2022
3b00778
kernel impl compile pass
RandyInterfish Feb 15, 2022
7589dc1
Update: pre-lookup ready. Unit tests all passed
RandyInterfish Feb 22, 2022
d65cec1
postlookup grad done except for unit test
RandyInterfish Mar 1, 2022
cca0b38
Unit test passed
RandyInterfish Mar 10, 2022
279a6de
python api works right
RandyInterfish Mar 10, 2022
6569c30
Update: try to optimize
RandyInterfish Mar 15, 2022
6c29295
Split pre_embedding_lookup
RandyInterfish Mar 15, 2022
84209bd
Update: python a[i
RandyInterfish Mar 16, 2022
78dab1d
Merge branch 'main' into features/gpu_embedding_fusion
RandyInterfish Mar 16, 2022
8c5b929
Merge branch 'features/gpu_embedding_fusion' into features/gpu_embedd…
RandyInterfish Mar 16, 2022
572c7d8
Update: modifying partition_select op
RandyInterfish Mar 17, 2022
c449de3
Update: modify partition_select
RandyInterfish Mar 18, 2022
22506c5
3rd version. Code modifying complete. No compile yet
RandyInterfish Mar 18, 2022
85fc956
Update: compilee pass
RandyInterfish Mar 18, 2022
04d4e60
add more partition strategies
RandyInterfish Mar 19, 2022
241e086
Add one test
RandyInterfish Mar 19, 2022
75a3bd4
Add more unit tests
RandyInterfish Mar 19, 2022
b738015
post op ut passed
RandyInterfish Mar 19, 2022
9ce8da1
ut all passed
RandyInterfish Mar 19, 2022
cf96164
optimize prune and fill moore
RandyInterfish Mar 19, 2022
99cd77e
Minor fixed
RandyInterfish Mar 20, 2022
e9115c3
Update: fix bug and update perf number for modelzoo
RandyInterfish Mar 22, 2022
091748e
Merge branch 'main' of https://github.com/nvzhou/DeepRec-public into …
RandyInterfish Mar 30, 2022
d59c9ce
update api def and golden
RandyInterfish Mar 30, 2022
3ecff7f
Update doc
RandyInterfish Apr 7, 2022
6a13c5b
Ajust interface to V2
RandyInterfish Apr 21, 2022
896d87e
Merge branch 'main' into features/gpu_embedding_fusion
RandyInterfish Apr 26, 2022
901fa8a
Make embedding fusion v1 and v2 compatible
RandyInterfish Apr 26, 2022
132b2ae
Update: delete comments
RandyInterfish Jun 28, 2022
400e0bc
Merge branch 'main' of https://github.com/alibaba/DeepRec into featur…
RandyInterfish Jun 28, 2022
676a324
temp
RandyInterfish Jul 1, 2022
18c62ad
prune and fill with sparse weight seems fine
RandyInterfish Jul 4, 2022
6daa4f4
added sparse_weight to postlookup and grad
RandyInterfish Jul 5, 2022
adf5dae
sparse_weight seems okay
RandyInterfish Jul 13, 2022
d352bfc
Merge branch 'main' of https://github.com/alibaba/DeepRec into featur…
RandyInterfish Jul 13, 2022
14a680b
modify modelzoo
RandyInterfish Jul 15, 2022
aa9f524
Fix unique_with_counts pre-volta hang issue
RandyInterfish Jul 29, 2022
4dfc639
Merge branch 'main' of https://github.com/alibaba/DeepRec into featur…
RandyInterfish Sep 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 81 additions & 57 deletions docs/Fused-Embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,39 @@

## 介绍

DeepRec 及 TensorFlow 原生的 embedding lookup 相关 API,如 safe_embedding_lookup_sparse,会创建比较多的 op,因此在 GPU 上执行时容易出现 kernel launch bound 的问题。因此,Embedding子图Fusion功能提供了一组接口,并提供了一组fusion ops,通过Fusion的Op,减少需要 launch 的 kernel 数量,并提供高性能的实现,达到在 GPU 上加速执行的目的。
DeepRec 及 TensorFlow 原生的 embedding lookup 相关 API,如 safe_embedding_lookup_sparse,会创建比较多的 op,因此在 GPU 上执行时容易出现 kernel launch bound 的问题,且部分 op 只有 CPU 实现,速度相对较慢。因此,Embedding子图Fusion功能提供了一组接口,并提供了一组fusion ops,通过Fusion的Op,减少需要 launch 的 kernel 数量,并提供高性能的实现,达到加速执行的目的。


## FeatureColumn接口

用户 FeatureColumn作为接口。embedding_column 会返回一个 EmbeddingColumn 的类实例,常用的 EmbeddingColumn 有:

1. `tensorflow/python/feature_column/feature_column_v2.py` 的 `EmbeddingColumn`
1. `tensorflow/contrib/layers/python/layers/feature_column.py` 的 `_EmbeddingColumn`
1. `tensorflow/python/feature_column/feature_column_v2.py` 的 `EmbeddingColumn`
2. `tensorflow/contrib/layers/python/layers/feature_column.py` 的 `_EmbeddingColumn`

然后一般会通过 `tf.feature_column.input_layer` 或 `tf.feature_column_ops.input_from_feature_columns` 等高级接口,将此实例传入,建立 lookup 相关计算图。
因此,Embedding子图Fusion功能给上述的 `EmbeddingColumn` 类都添加了 `do_fusion` 属性,默认为 `False`,用户在使用时,可以显示的设置为 `True`,让 embedding lookup 过程使用 fused ops。
因此,Embedding子图Fusion功能给上述的 `EmbeddingColumn` 类都添加了 `do_fusion` 属性,默认为 None,用户在使用时,可以显示的设置为 `'v1', 'v2'` 这样的 fusion 版本,让 embedding lookup 过程使用 fused ops。
如下:


a. tf.feature_column.embedding_column

```python
import tensorflow as tf
from tensorflow.python.framework import ops


columns = tf.feature_column.categorical_column_with_embedding("col_emb", dtype=tf.dtypes.int64)
W = tf.feature_column.embedding_column(categorical_column=columns,
column = tf.feature_column.categorical_column_with_embedding("col_emb", dtype=tf.dtypes.int64)
W = tf.feature_column.embedding_column(
categorical_column=column,
dimension=3,
initializer=tf.ones_initializer(tf.dtypes.float32),
do_fusion=True)
do_fusion='v2')

ids={}
ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[1,1],[2,2],[3,3],[4,4]], values=tf.cast([1,2,3,4,5], tf.dtypes.int64), dense_shape=[5, 4])

# 传入设置了 use_fused_lookup 的 EmbeddingColumn 实例
# 传入设置了 do_fusion 的 EmbeddingColumn 实例
emb = tf.feature_column.input_layer(ids, [W])
fun = tf.multiply(emb, 2.0, name='multiply')
loss = tf.reduce_sum(fun, name='reduce_sum')
Expand All @@ -43,6 +50,9 @@ with tf.Session() as sess:
print(sess.run([emb, train_op,loss]))
print(sess.run([emb, train_op,loss]))
```

b. tf.contrib.layers.python.layers.feature_column.embedding_column

```python
import tensorflow as tf
from tensorflow.python.framework import ops
Expand All @@ -54,7 +64,7 @@ columns = feature_column.sparse_column_with_embedding(column_name="col_emb", dty
W = feature_column.embedding_column(sparse_id_column=columns,
dimension=3,
initializer=tf.ones_initializer(tf.dtypes.float32),
do_fusion=True)
do_fusion='v2')


ids={}
Expand Down Expand Up @@ -87,12 +97,20 @@ def fused_safe_embedding_lookup_sparse(embedding_weights,
name=None,
partition_strategy="div",
max_norm=None,
prune=True):
prune=True,
blocknums=None,
fusion_version='v2'):
```
此接口与 DeepRec 的 `safe_embedding_lookup_sparse` 接口功能是一致的。因此参数不再赘述,可查看相关文档


## fused_embedding_lookup_sparse接口

### 使用 v1 版本

通过 `nn.fused_embedding_lookup_sparse`
```python
@tf_export(v1=["nn.fused_embedding_lookup_sparse"])
def fused_embedding_lookup_sparse(params,
sp_ids,
sparse_weights=None,
Expand All @@ -102,73 +120,79 @@ def fused_embedding_lookup_sparse(params,
max_norm=None,
default_id=None,
prune_invalid_ids=False,
fill_empty_row=True,
blocknums=None):
```

### 使用 v2 版本

通过 `nn.fused_embedding_lookup_sparse_v2`
```python
@tf_export(v1=["nn.fused_embedding_lookup_sparse_v2"])
def fused_embedding_lookup_sparse_v2(params,
sp_ids,
sparse_weights=None,
partition_strategy=None,
name=None,
combiner=None,
max_norm=None,
default_id=None,
prune=False,
fill_empty_row=True,
blocknums=None):
```

### 参数说明

- `params`: List,可以含有单个的 embedding tensor 或是被 partition 过的 embedding tensors。embedding tensors 的 rank 须都为 2。
- `sp_ids`: SparseTenor,其 values 为需要查找的 id。indices 的 rank 须为 2。dense_shape 的 rank 须为 1,元素个数为 2。
- `sparse_weights`: sparse_ids 的 values 的权重。
- `sparse_weights`: sparse_ids 的 values 的权重。目前还暂不支持。
- `partition_strategy`: embedding tensor 的 partition 策略。
- `name`: 此 operation 的名称。
- `combiner`: entry 维度进行 combine 的策略。
- `max_norm`: 如果不为 None, 则对每个 embedding vector 都计算 l2,然后对于超过 max_norm 值的进行 normalization。
- `default_id`: 对于 empty 的 row,填充 default_id。如果 default_id 为 None, 则默认填充 0。
- `prune_invalid_ids`: 是否对 sparse_ids 去除非法值(id < 0)。
- `default_id`: 若 `fill_empty_row=True`, 则对于 empty 的 row,填充 default_id。如果 default_id 为 None, 则默认填充 0。
- `fill_empty_row`: 是否对 sparse_ids 进行空行填充,结合 `default_id` 使用。
- `prune_invalid_ids` or `prune`: 是否去除非法值。
- `blocknums`: DynamicEmbeddingVariable 使用的参数。


## 注意事项
1. `v2` 目前仅有 GPU 实现。
2. `v2` 目前支持 `sparse_weights` 功能,`v1` 还不支持。
3. 目前不支持动态弹性维度、Multi-Hash Variable、AdaptiveEmbedding功能,后续会逐步支持。
4. 使用 GPU fusion 时,可以考虑 `export TF_GPU_THREAD_MODE="gpu_private"` 以及 `export TF_GPU_THREAD_COUNT=1`。测试发现在 feature 数目较多的情况下,GPU 使用单线程去 lanuch kernels 时 overhead 较小,有助于进一步提速。


1. 目前 Embedding子图Fusion当前支持 Nvidia GPU 上执行。相应的 `tf.Variable` 和 `EmbeddingVariable` 及其他算子可以在 CPU 上。其中CPU版本的Embedding Fusion子图功能正在代码开发中。
1. 目前不支持设置权重 `sparse_weights`。
1. partition_strategy 目前只支持 div ,且在 axis = 0 上对 embedding tensor 做切分。且如果 embedding tensor 是 EmbeddingVariable 的话,目前只能是单个完整的 ev,还不支持对 ev 进行 partition 的查找模式。
1. 目前不支持动态弹性维度、Multi-Hash Variable、AdaptiveEmbedding功能,后续会逐步支持。
## Op 介绍及计算图
新增了 Fused Embedding 相关算子:
## Op 介绍

### Fused Embedding V1 相关算子:

1. FusedEmbeddingSparsePreLookUp
2. FusedEmbeddingSparsePostLookUp
3. FusedEmbeddingSparsePostLookUpGrad

FusedEmbeddingSparsePreLookUp 主要负责 fill empty row, prune invalid id, 以及根据 partition_strategy 对 sp_ids 的 values 和 indices 进行划分。
tf.Gather 与 EmbeddingVariable 或 tf.Variable 在同一个 device 上,在 partition 的情况下可能有多份,在不同的 device 上(分布式)。它负责接受 PreEmbedding 划分过的 values 和 indices,进行实际的 embedding vector 查找。
FusedEmbeddingSparsePostLookUp 则负责将 embedding vector 从各个 parition 上收集回来,然后进行 combiner 及 max_norm 等相关操作。
FusedEmbeddingSparsePostLookUpGrad 负责 FusedEmbeddingSparsePostLookUp 的反向梯度计算。

### Fused Embedding V2 相关算子:

以底层级接口 `fused_embedding_lookup_sparse` 为例,调用之后会创建如下的计算图:
![img_1.png](Fused-Embedding/img_1.png)

1. **FusedEmbeddingSparsePreLookUp** 主要负责 fill empty row, prune invalid id, 以及根据 partition_strategy 对 sp_ids 的 values 和 indices 进行划分。
1. PruneInvalidAndFillEmptyRows
2. UniqueWithCountsV3
3. PartitionWithPermutation
4. FusedEmbeddingSparsePostLookUpV2
5. FusedEmbeddingSparsePostLookUpV3Grad

2. **tf.Gather** 与 **EmbeddingVariable** 或 **tf.Variable** 在同一个 device 上,在 partition 的情况下可能有多份,在不同的 device 上(分布式)。它负责接受 PreEmbedding 划分过的 values 和 indices,进行实际的 embedding vector 查找。
调用 `fused_embedding_lookup_sparse_v2` 之后会依照下列顺序创建计算图:

3. **FusedEmbeddingSparsePostLookUp** 则负责将 embedding vector 从各个 parition 上收集回来,然后进行 combiner 及 max_norm 等相关操作。

4. **FusedEmbeddingSparsePostLookUpGrad** 负责 FusedEmbeddingSparsePostLookUp 的反向梯度计算。
1. PruneInvalidAndFillEmptyRows 负责去除非法值及填充空行
2. UniqueWithCountsV2 负责对 sparse_ids 进行 unique 操作,在多机多卡的情况下可以减少通信量
3. PartitionWithPermutation 在需要对 sparse_ids 进行 partition 时候,会创建此算子,按照不同的策略进行 partition
4. **tf.Gather** 与 **EmbeddingVariable** 或 **tf.Variable** 在同一个 device 上,在 partition 的情况下可能有多份,在不同的 device 上(分布式)。它进行实际的 embedding vector 查找。
5. **FusedEmbeddingSparsePostLookUp** 则负责将 embedding vector 从各个 parition 上收集回来,然后进行 combiner 及 max_norm 等相关操作。
6. **FusedEmbeddingSparsePostLookUpGrad** 负责 FusedEmbeddingSparsePostLookUp 的反向梯度计算。

## 性能对比
在 modelzoo 中,对比了一些 model 在 unfused 以及 fused embedding 情况下性能提升(5000个 iteration 平均结果)

Machine:
8 cores AMD EPYC 7232P CPU @ 3.20GHz.

A100-80GB-PCIE GPU

DLRM Model:

| | Avg Time per Iteration |
| ------- | ---------------------- |
| Unfused | 20.78 ms |
| Fused | 17.41 ms |
| SpeedUp | 1.19x |

DeepFM Model:

| | Avg Time per Iteration |
| ------- | ---------------------- |
| Unfused | 37.24 ms |
| Fused | 30.98 ms |
| SpeedUp | 1.20x |

WDL Model:

| | Avg Time per Iteration |
| ------- | ---------------------- |
| Unfused | 36.38 ms |
| Fused | 34.52 ms |
| SpeedUp | 1.05x |
v2 算子 GPU 相关,见 `modelzoo/features/GPUFusedEmbedding` 下的测试数据
Binary file removed docs/Fused-Embedding/img_1.png
Binary file not shown.
3 changes: 2 additions & 1 deletion modelzoo/features/gpu_fused_embedding/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
*/result/model_*
record.py
*.sh
*.nsys-rep
*.nsys-rep
*.gz
18 changes: 12 additions & 6 deletions modelzoo/features/gpu_fused_embedding/deepfm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,21 @@ The only difference is that this model use GPU Fused Embedding to acclerate the
```python
categorical_embedding_column = tf.feature_column.embedding_column(
categorical_column, dimension=16, combiner='mean',
do_fusion=True)
do_fusion='v2')
```

## Benchmark

On A100-80GB-PCIE GPU, with 8 cores AMD EPYC 7232P CPU @ 3.20GHz. Average of 5000 iterations. The perf boost:

| | Avg Time per Iteration |
| ------- | ---------------------- |
| Unfused | 37.24 ms |
| Fused | 30.98 ms |
| SpeedUp | 1.20x |
Let tensorflow use private single thread for GPU kernels:

```bash
export TF_GPU_THREAD_MODE="gpu_private"
export TF_GPU_THREAD_COUNT=1
```

| | Unfused | Fused | Speedup |
| ---------------------------- | ------- |
| Step Time, Batch Size = 512 | 31.2ms | 24.1ms | 1.29x |
| Step Time, Batch Size = 4096 | 57.1ms | 44.0ms | 1.29x |
2 changes: 1 addition & 1 deletion modelzoo/features/gpu_fused_embedding/deepfm/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ def build_feature_cols():

categorical_embedding_column = tf.feature_column.embedding_column(
categorical_column, dimension=16, combiner='mean',
do_fusion=True)
do_fusion='v2')

wide_column.append(categorical_embedding_column)
deep_column.append(categorical_embedding_column)
Expand Down
21 changes: 14 additions & 7 deletions modelzoo/features/gpu_fused_embedding/dlrm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,22 @@ The only difference is that this model use GPU Fused Embedding to acclerate the
```python
categorical_embedding_column = tf.feature_column.embedding_column(
categorical_column, dimension=16, combiner='mean',
do_fusion=True)
do_fusion='v2')
```

## Benchmark

On A100-80GB-PCIE GPU, with 8 cores AMD EPYC 7232P CPU @ 3.20GHz. Average of 5000 iterations. The perf boost:
On A100-80GB-PCIE GPU, with 8 cores AMD EPYC 7232P CPU @ 3.20GHz. Average of 5000 iterations.
Let tensorflow use private single thread for GPU kernels:

| | Avg Time per Iteration |
| ------- | ---------------------- |
| Unfused | 20.78 ms |
| Fused | 17.41 ms |
| SpeedUp | 1.19x |
```bash
export TF_GPU_THREAD_MODE="gpu_private"
export TF_GPU_THREAD_COUNT=1
```

The perf boost:

| | Unfused | Fused | Speedup |
| ---------------------------- | ------- |
| Step Time, Batch Size = 512 | 19.98ms | 14.81ms | 1.34x |
| Step Time, Batch Size = 4096 | 37.82ms | 28.82ms | 1.31x |
6 changes: 3 additions & 3 deletions modelzoo/features/gpu_fused_embedding/dlrm/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def build_feature_cols():
tf.feature_column.embedding_column(categorical_column,
dimension=16,
combiner='mean',
do_fusion=True))
do_fusion='v2'))
else:
column = tf.feature_column.numeric_column(column_name, shape=(1, ))
dense_column.append(column)
Expand Down Expand Up @@ -288,7 +288,7 @@ def optimizer(self):
tf.summary.scalar('loss', loss)

self.global_step = tf.train.get_or_create_global_step()
optimizer = tf.train.GradientDescentOptimizer(
optimizer = tf.train.AdamOptimizer(
learning_rate=self.learning_rate)

train_op = optimizer.minimize(loss, global_step=self.global_step)
Expand Down Expand Up @@ -619,4 +619,4 @@ def main(tf_config=None, server=None):
server=server)
else:
print("Task type or index error.")
sys.exit()
sys.exit()
18 changes: 12 additions & 6 deletions modelzoo/features/gpu_fused_embedding/wide_and_deep/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,21 @@ The only difference is that this model use GPU Fused Embedding to acclerate the
deep_columns.append(tf.feature_column.embedding_column(
categorical_column,
dimension=EMBEDDING_DIMENSIONS[column_name],
combiner='mean', do_fusion=True))
combiner='mean', do_fusion='v2'))
```

## Benchmark

On A100-80GB-PCIE GPU, with 8 cores AMD EPYC 7232P CPU @ 3.20GHz. Average of 5000 iterations. The perf boost:

| | Avg Time per Iteration |
| ------- | ---------------------- |
| Unfused | 36.38 ms |
| Fused | 34.52 ms |
| SpeedUp | 1.05x |
Let tensorflow use private single thread for GPU kernels:

```bash
export TF_GPU_THREAD_MODE="gpu_private"
export TF_GPU_THREAD_COUNT=1
```

| | Unfused | Fused | Speedup |
| ---------------------------- | ------- |
| Step Time, Batch Size = 512 | 41.3ms | 38.4ms | 1.07x |
| Step Time, Batch Size = 4096 | 75.1ms | 66.5ms | 1.12x |
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def minmaxscaler(col):
deep_columns.append(tf.feature_column.embedding_column(
categorical_column,
dimension=EMBEDDING_DIMENSIONS[column_name],
combiner='mean', do_fusion=True))
combiner='mean', do_fusion='v2'))
else:
normalizer_fn = None
i = CONTINUOUS_COLUMNS.index(column_name)
Expand Down
Loading