Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HuBERT features to train SyncNet, the loss does not converge. #150

Open
hnsywangxin opened this issue Jun 29, 2024 · 2 comments
Open

Comments

@hnsywangxin
Copy link

I have replaced the mel spectrogram with HuBERT features to train wav2lip, and it runs through, but when training SyncNet, the loss keeps hovering around 0.69 and won't go down. It can be reduced with mel spectrograms. I would like to ask for help to see what the problem might be.

1: The face encoding dimension of wav2lip is (8, 1024, 1, 1), where 8 represents the batch size. However, the feature dimension of HuBERT that I use is (8, 1024, 10). The input dimension of mel is (8, 1, 80, 16), and after convolution, it becomes (8, 1024, 1, 1), which can be trained normally. Therefore, I first use permute to perform dimension conversion, and then use Conv1D convolution to reduce the last dimension, ultimately obtaining (8, 1024, 1, 1). The code is as follows:
image

2:audio_encoder code:
image

And I also modified the network to make it deeper, but it still didn't work. the new network as follows:
image

I also change BCEloss to MSEloss, but loss does not converge! can you help me , thanks!

@primepake
Copy link
Owner

the loss should be BCE instead of MSE loss. also can you provide the code?

@hnsywangxin
Copy link
Author

hnsywangxin commented Jul 2, 2024

the loss should be BCE instead of MSE loss. also can you provide the code?

Thanks for your reply, I used BCE loss, but the result is same. I only changed syncnet.py , other files is same with your repo, and my hubert features from meta's hubert offical repo, my syncnet as follow:

class SyncNet_color_hubert(nn.Module):
    def __init__(self):
        super(SyncNet_color_hubert, self).__init__()

        self.face_encoder = nn.Sequential(
            Conv2d(15, 16, kernel_size=(7, 7), stride=1, padding=3, act="leaky"),  # 192, 384

            Conv2d(16, 32, kernel_size=5, stride=(1, 2), padding=1, act="leaky"),  # 192, 192
            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(32, 64, kernel_size=3, stride=2, padding=1, act="leaky"),  # 96, 96
            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(64, 128, kernel_size=3, stride=2, padding=1, act="leaky"),  # 48, 48
            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(128, 256, kernel_size=3, stride=2, padding=1, act="leaky"),  # 24, 24
            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(256, 512, kernel_size=3, stride=2, padding=1, act="leaky"),
            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),  # 12, 12

            Conv2d(512, 1024, kernel_size=3, stride=2, padding=1, act="leaky"),
            Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),  # 6, 6

            Conv2d(1024, 1024, kernel_size=3, stride=2, padding=1, act="leaky"),  # 3, 3
            Conv2d(1024, 1024, kernel_size=3, stride=1, padding=0, act="leaky"),
            Conv2d(1024, 1024, kernel_size=1, stride=1, padding=0, act="relu"))  # 1, 1

self.audio_encoder = nn.Sequential(
            SameBlock1d(1024, 1024, kernel_size=7, padding=3), # 10
            ResBlock1d(1024, 1024, 3, 1),
            # 9-5
            DownBlock1d(1024, 1024, 3, 1), # 5
            ResBlock1d(1024, 1024, 3, 1),
            # 5 -3
            DownBlock1d(1024, 1024, 3, 1),
            ResBlock1d(1024, 1024, 3, 1),
            # 3-2
            DownBlock1d(1024, 1024, 3, 1),
            SameBlock1d(1024, 1024, kernel_size=3, padding=1)
        )
        self.global_avg1d = nn.AdaptiveAvgPool1d(1)

    def forward(self, audio_sequences, face_sequences):  # audio_sequences := (B, dim, T)
        face_embedding = self.face_encoder(face_sequences)
        audio_sequences = audio_sequences.permute(0, 2, 1)
        audio_embedding = self.audio_encoder(audio_sequences) # audio_embedding: (8,1024,1)
        audio_embedding = self.global_avg1d(audio_embedding).unsqueeze(2)
        audio_embedding = audio_embedding.view(audio_embedding.size(0), -1)
        face_embedding = face_embedding.view(face_embedding.size(0), -1)

        # audio_embedding = F.normalize(audio_embedding, p=2, dim=1)
        face_embedding = F.normalize(face_embedding, p=2, dim=1)

        return audio_embedding, face_embedding

ResBlock1d and DownBlock1d refer to DInet:https://github.com/MRzzm/DINet/blob/3b57fb0a2482213327890fbb76baeafdaa412597/models/Syncnet.py#L3 and https://github.com/MRzzm/DINet/blob/3b57fb0a2482213327890fbb76baeafdaa412597/models/Syncnet.py#L55
thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants