Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any update plan for Adaface? #91

Closed
whalefa1I opened this issue Jun 5, 2022 · 35 comments
Closed

Is there any update plan for Adaface? #91

whalefa1I opened this issue Jun 5, 2022 · 35 comments

Comments

@whalefa1I
Copy link

FYI https://paperswithcode.com/paper/adaface-quality-adaptive-margin-for-face

@leondgarse
Copy link
Owner

I will take a check, thanks for the reminding. Just currently occupied with something else. For this project, I'm trying partialFC on Glint360K recently, and it took me a long training time...

@whalefa1I
Copy link
Author

我看他们训练clip多模态预训练模型的时候有的人用过 Gradient Accumulation,你看看有没有能帮到你的

@leondgarse
Copy link
Owner

leondgarse commented Jun 5, 2022

@leondgarse
Copy link
Owner

在做了在做了,已经把 webface4m 和 webface12m 的 r100 模型转化出来了,跑跑验证就开始写损失函数了。

@leondgarse
Copy link
Owner

AdaFaceLoss 更新了:

  • 两个转化的模型 r00 webface4mr100 webface12m
  • AdaFaceLoss 只是跑了几个 batch 确认训练中 loss 会收敛,还没有完整训练

@whalefa1I
Copy link
Author

adaface和magface都有个反映人脸图片质量的方式,做个norm啥的就行,你有注意过吗

@leondgarse
Copy link
Owner

对的,Converted MagFace / AdaFace r50 / r100 model and face quality testing #57 这里就是用 norm 值作为人脸质量值在 cfp_fp / agedb_30 上的测试

@whalefa1I
Copy link
Author

你这手也太快了,有模型吗,我自己训练的感觉没啥效果啊,为啥模糊的比好的分还高

@whalefa1I
Copy link
Author

哦刷新看到模型了,我试试

@leondgarse
Copy link
Owner

人脸质量测试感觉不如 magface,还得看看论文人家是怎么用的。Readme 里 EffV2S,MagFace 这个是自己训练的,质量测试效果看起来还可以

@leondgarse
Copy link
Owner

另外还有一个使用 MagFace 结果再做人脸质量训练的 QMagFace: Simple and Accurate Quality-Aware Face Recognition

@leondgarse
Copy link
Owner

  • 从论文中的附件 B.1. Correlation between Norm and BRISQUE during Training 感觉 adaface 的 norm 值不能用来判断人脸质量
  • AdaFace head.py#L72 的实现 safe_norms = safe_norms.clone().detach() 感觉应该将整个 margin 计算的过程放到 tf.stop_gradient 中,这样也符合论文中描述 Gradient doesn't flow to ∥zi∥
  • 原来上传的模型是从官方移植过来的,使用的是 BGR 输入,重新上传了两个 adaface_ir101_webface*m_rgb.h5,使用 RGB 输入的,修正了验证数据集上的准确度
  • 先不要使用当前的 AdaFace 实现吧,需要跑一下训练验证一下

@whalefa1I
Copy link
Author

1、torch模型怎么移植成tf的格式嘞,需要换框架复现代码重新训练,还是把权重拿出来就行了
2、有没有什么方法可以固定两个框架的随机初始化数值,判断复现结果是否一致

@leondgarse
Copy link
Owner

  1. 直接转化权重,具体的过程在 Atom_notebook/adaface-model,使用的是我另一个项目 keras_cv_attention_models
  2. 随机初始化的数值一般可以固定随机的 seed,或者也可以用 pytorch 初始化好权重,然后用 1 的方式转化成 keras 模型,但很多其他问题,比如 SGD 的 weight_decay 方式不同,Adaface 训练过程中添加了一些随机裁剪 / 随机质量的强化等等,单纯固定初始化数值不能保证完全复现训练过程。
  3. 之前跑的训练一直用的是 Adamw,EfficientNetV2S + adamw / r100 + adamw,前几天刚刚发现 adamw 训练的一些问题,会使 batch_norm 的 moving_variance 变的很大,可能是这个原因导致了 loss=nan,正在用 sgd / sgdw 重新跑。

@leondgarse leondgarse reopened this Jun 23, 2022
@whalefa1I
Copy link
Author

你 是 我的神

@leondgarse
Copy link
Owner

倒是也不必

@whalefa1I
Copy link
Author

我把keras_cv_attention_models里面的download_and_loadtest_images放到项目里,然后把adaface里面的nethead也放进去,下的ckpt就是“adaface_ir101_webface4m.ckpt”,但是我在convert的时候报错了

====================
stack1_block1_shortcut_conv
Traceback (most recent call last):
  File "/data/xixi/project/Github/Keras_insightface/torch_model_conversion.py", line 21, in <module>
    download_and_load.keras_reload_from_torch_model(
  File "/data/xixi/project/Github/Keras_insightface/download_and_load.py", line 311, in keras_reload_from_torch_model
    keras_reload_stacked_state_dict(keras_model, stacked_state_dict, aligned_names, additional_transfer, save_name=save_name)
  File "/data/xixi/project/Github/Keras_insightface/download_and_load.py", line 166, in keras_reload_stacked_state_dict
    torch_weight[0] = np.transpose(torch_weight[0], (2, 3, 1, 0))
  File "<__array_function__ internals>", line 180, in transpose
  File "/home/nlp/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 660, in transpose
    return _wrapfunc(a, 'transpose', axes)
  File "/home/nlp/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
ValueError: axes don't match array

具体的在'stack1_block1_shortcut_conv'在这层里面维度不是普通卷积,没法进行transpose
stack1_block1_shortcut_conv

还需要改啥吗

@whalefa1I
Copy link
Author

用的代码是:

mm = models.buildin_models('r100', output_layer='E', activation="PReLU", bn_momentum=0.9, bn_epsilon=1e-5, use_bias=True, scale=False, use_max_pool=True)

tail_align_dict = {"shortcut_conv": -4, "shortcut_bn": -5}
full_name_align_dict = {"E_batchnorm": 3, "E_dense": 4, "pre_embedding": 5}
# [25088, 512] -> CHW + out [512, 7, 7, 512] -> HWC + out [7, 7, 512, 512] -> [25088, 512]
additional_transfer={
    "E_dense": lambda ww: [ww[0].reshape(512, 7, 7, 512).transpose([1, 2, 0, 3]).reshape([-1, 512]), ww[1]],
    "pre_embedding": lambda ww: [np.zeros(512), *ww],
}
download_and_load.keras_reload_from_torch_model(
    'adaface_ir101_webface4m.ckpt',
    keras_model=mm,
    tail_align_dict=tail_align_dict,
    full_name_align_dict=full_name_align_dict,
    additional_transfer=additional_transfer,
    input_shape=(112, 112),
    do_convert=True,
    save_name="adaface_ir101_webface4m.h5",
)

@leondgarse
Copy link
Owner

leondgarse commented Jun 23, 2022

你的 Keras_insightface/backbones/resnet.py 没有更新吧,指定 use_max_pool=Truestack_1_block_1 没有 shortcut_conv,就是 adaface 用的 resnet 结构,更新一下你的 Keras_insightface/backbones/resnet.py

@whalefa1I
Copy link
Author

y_true = tf.one_hot(tf.random.uniform([32], 1, 10, dtype='int32'), 10)
y_pred = tf.random.uniform([32, 10])
y_pred_norm = tf.concat([y_pred, tf.norm(y_pred, axis=-1, keepdims=True)], axis=-1)
import losses
aa = losses.AdaFaceLoss()
print(aa(y_true, y_pred_norm))


import torch
import head
from torch.nn import CrossEntropyLoss
bb = head.AdaFace(embedding_size=10, classnum=32)
cc = bb(torch.from_numpy(y_pred_norm[:, :-1].numpy()), torch.from_numpy(y_pred_norm[:, -1:].numpy()), torch.from_numpy(np.argmax(y_true, axis=-1)))
cross_entropy_loss = CrossEntropyLoss()
loss = cross_entropy_loss(cc, torch.from_numpy(np.argmax(y_true, axis=-1)))
print(loss)

是不是因为源码head刚进来做了个随机初始化的全连接,算出margin loss的cosine。keras的loss进来之前你已经在上面做过norm了,所以不具备数值意义上的可比性。如果观察backprop的复现效果的话,一般是大概差不多,符合论文说明,收敛就行了,还是要像个办法严格控制数值呢

@leondgarse
Copy link
Owner

leondgarse commented Jun 24, 2022

啊,你说这个,这个对比测试需要改一下代码:

  • PyTorch 的 head.py 使用输入的 embbedings 直接作为 cosine 值
    65     def forward(self, embbedings, norms, label):
    66
    67         # kernel_norm = l2_norm(self.kernel,axis=0)
    68         # cosine = torch.mm(embbedings,kernel_norm)
    69         # cosine = cosine.clamp(-1+self.eps, 1-self.eps) # for stability
    70         cosine = embbedings
  • Keras 的 losses.pyAdaFaceLoss 408 行去掉注释 return arcface_logits,直接返回 arcface_logits
    408        return arcface_logits
    409        # return tf.keras.losses.categorical_crossentropy(y_true, arcface_logits, from_logits=self.from_logits, label_smoothing=self.label_smoothing)
  • 测试
    y_true = tf.one_hot(tf.random.uniform([32], 1, 10, dtype='int32'), 10)
    y_pred = tf.random.uniform([32, 10])
    y_pred_norm = tf.concat([y_pred, tf.norm(y_pred, axis=-1, keepdims=True)], axis=-1)
    import losses
    aa = losses.AdaFaceLoss()
    aa(y_true, y_pred_norm)
    
    sys.path.append('../AdaFace-master/')
    import torch
    import head
    bb = head.AdaFace(t_alpha=0.01)
    cc = bb(torch.from_numpy(y_pred_norm[:, :-1].numpy()), torch.from_numpy(y_pred_norm[:, -1:].numpy()), torch.from_numpy(np.argmax(y_true, axis=-1)))
    
    print(f"{aa(y_true, y_pred_norm).numpy() = }, {cc.mean() = }")
    # aa(y_true, y_pred_norm).numpy() = 30.912012, cc.mean() = tensor(30.9092)
    去掉 scale 放大的 64 倍的话,两个值基本相同
    print(f"{aa(y_true, y_pred_norm).numpy() / 64 = }, {cc.mean() / 64 = }")
    # aa(y_true, y_pred_norm).numpy() / 64 = 0.4830001890659332, cc.mean() / 64 = tensor(0.4830)

@leondgarse
Copy link
Owner

你上面的模型转化成功了吗?

@whalefa1I
Copy link
Author

成功惹!应该就是shortcut的原因

@whalefa1I
Copy link
Author

  • AdaFace head.py#L72 的实现 safe_norms = safe_norms.clone().detach() 感觉应该将整个 margin 计算的过程放到 tf.stop_gradient 中,这样也符合论文中描述 Gradient doesn't flow to ∥zi∥
norm_mean = tf.stop_gradient(tf.math.reduce_mean(feature_norm))
samples = tf.cast(tf.maximum(1, feature_norm.shape[0] - 1), feature_norm.dtype)
norm_std = tf.stop_gradient(tf.sqrt(tf.math.reduce_sum((feature_norm - norm_mean) ** 2) / samples))  # Torch std
self.batch_mean.assign(self.mean_std_alpha * norm_mean + (1.0 - self.mean_std_alpha) * self.batch_mean)
self.batch_std.assign(self.mean_std_alpha * norm_std + (1.0 - self.mean_std_alpha) * self.batch_std)

具体有啥地方需要改吗,感觉没啥差别哎。是两个框架stop gradient的逻辑不一样吗

@leondgarse
Copy link
Owner

更新了,因为训练还没有跑完,之前这部分没有更新

@whalefa1I
Copy link
Author

按这样的话,等价的pytorch是不是

with torch.no_grad():
        mean = safe_norms.mean().detach()
        std = safe_norms.std().detach()
        self.batch_mean = mean * self.t_alpha + (1 - self.t_alpha) * self.batch_mean
        self.batch_std =  std * self.t_alpha + (1 - self.t_alpha) * self.batch_std

        margin_scaler = (safe_norms - self.batch_mean) / (self.batch_std+self.eps) # 66% between -1, 1
        margin_scaler = margin_scaler * self.h # 68% between -0.333 ,0.333 when h:0.333
        margin_scaler = torch.clip(margin_scaler, -1, 1)

还是说torch放外面就可以

@leondgarse
Copy link
Owner

我对 pytorch 没那么熟悉,根据一些文章来看,比如 Difference between detach().clone() and clone().detach(),我认为 safe_norms = safe_norms.clone().detach() 与将 safe_norm 相关的所有计算放到 torch.no_grad 里面应该是等效的,使用 clone().detach() 这种方式可能是更确保截断了梯度。

@leondgarse
Copy link
Owner

这个解释的更好点 Detach, no_grad and requires_grad

@whalefa1I
Copy link
Author

感觉是的,应该就是双保险的意思,或者最多是torch.no_grad做了内存优化,快一些。

@leondgarse
Copy link
Owner

目前的结果看起来还不错,r50 + SGD + AdaFace 53 epochs:

import losses, train, models
import tensorflow_addons as tfa
keras.mixed_precision.set_global_policy("mixed_float16")

data_basic_path = '/datasets/ms1m-retinaface-t1'
data_path = data_basic_path + '_112x112_folders'
eval_paths = [os.path.join(data_basic_path, ii) for ii in ['lfw.bin', 'cfp_fp.bin', 'agedb_30.bin']]

basic_model = models.buildin_models('r50', dropout=0.4, emb_shape=512, output_layer='E', bn_momentum=0.9, bn_epsilon=1e-5, scale=True, use_bias=False, activation='prelu', use_max_pool=True)
basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False)

tt = train.Train(data_path, eval_paths=eval_paths,
    save_path='TT_r50_max_pool_E_prelu_dr04_lr_01_l2_5e4_adaface_emb512_sgd_m09_bs512_ms1m_64_only_margin_SG_scale_true_bias_false_random_100.h5',
    basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-6, lr_warmup_steps=3,
    batch_size=512, random_status=100, eval_freq=4000, output_weight_decay=1)

# optimizer = tfa.optimizers.AdamW(learning_rate=1e-2, weight_decay=5e-4, exclude_from_weight_decay=["/gamma", "/beta"])
# optimizer = tfa.optimizers.SGDW(learning_rate=1e-2, weight_decay=5e-6, momentum=0.9, exclude_from_weight_decay=["/gamma", "/beta"])
optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
sch = [
    {"loss": losses.AdaFaceLoss(scale=64), "epoch": 53, "optimizer": optimizer},
]
tt.train(sch, 0)

r50_sgd_adaface

1e-06 1e-05 0.0001 0.001 0.01 0.1 AUC
r50 IJBB 0.393379 0.91334 0.955501 0.970204 0.978773 0.986465 0.993366
r50 IJBC 0.888633 0.952702 0.969269 0.979496 0.985734 0.991052 0.995485

PyTorch 训练 26 epochs 的结果:

Arch Dataset Method IJBB TAR@FAR=0.01% IJBC TAR@FAR=0.01%
R50 WebFace4M AdaFace 95.44 96.98
R50 MS1MV2 AdaFace 94.82 96.27

@whalefa1I
Copy link
Author

这个还挺好玩得,感觉接在模型后面就行

@leondgarse
Copy link
Owner

大概看了下,感觉速度不快的样子,单张图片 get_scaled_quality 调用 100 次前向,get_gradients 再调用反向,感觉不太好跟目前的实现集成

@leondgarse
Copy link
Owner

Adaface + r100 的训练结果这几天应该会上传,53 Epochs 的结果是 IJBB 0.961636,IJBC 0.972849,相对于PyTorch的26 Epochs IJBB 95.84, IJBC 97.09

@whalefa1I
Copy link
Author

恭喜啊!!!tql!!!那我只用ghostnet就够了!注意身体嗷

@leondgarse
Copy link
Owner

r100 的训练结果上传了,可以作为训练 ghostnet 的参考

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants