The mainstream of speech super-resolution tasks is categorized into three approaches, frequency domain, time domain, and time-frequency domain. From early mapping approaches to subsequent generative network-based approaches.
A deep neural network approach to speech bandwidth expansion ICASSP 2015
Method: Prediction with DNN, MMSE as loss function, input 9 frames of logarithmic magnitude spectrum, need to do norm, ensure phase using -filp.
Speech Super Resolution Generative Adversarial Network ICASSP 2019
Method: A CRN was used as the generator, a CNN was used as the discriminator, and loss penalized the generator for gradient.
Time-Frequency Loss for CNN Based Speech Super-Resolution ICASSP 2020
Method: The authors propose that the performance improvement of TFNet stems from the simultaneous use of information in the network in both the time and frequency domains, but it also consumes computational resources. In this paper, we propose to use MAE loss in both time and frequency domains, and only magnitude information is used in the frequency domain, due to the fact that the magnitude spectrum can be observed with time-frequency information while the frequency spectrum cannot. The two are balanced with a grid search parameter when added together.
Low-Complexity Streaming Speech Super-Resolution IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP) 2023
Method: MLP was used for the model structure and TF loss was used, where Kaiming uniform initialization was used for network initialization.
A Two-stage Approach to Speech Bandwidth Extension Interspeech 2021
Method: Two stages, frequency domain followed by time domain. The first stage uses TCN, which the authors argue is better because the sensory field of TCN is larger than that of CRN. The second stage uses a simplified version of WaveUnet.
BAE-Net: a Low Complexity and High Fidelity Bandwidth-Adaptive Neural Network for Speech Super-Resolution ICASSP 2024
Method: A dual-stream model, the magnitude-stream model can be used alone, and the overall architecture is a CRN with an attention-mechanism-like structure in the tail of the model. The phase and magnitude streams interact with each other. The discriminator uses MS-STFT.