ESP-SR Release V1.2
Documentation for release ESP-SR is available at https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/index.html
We're excited to announce the release of ESP-SR v1.2.0, an advanced version of Espressif's Speech Recognition Library. This update brings 5 significant enhancements to the previous version and is designed specifically for the ESP32-S3 microcontroller platforms.
- Improved accuracy: The speech recognition accuracy has been substantially increased, ensuring a more reliable and efficient user experience.
- Support for custom keywords: You can now easily add your own keywords for customized voice command recognition.
- Enhanced noise reduction: Our improved noise reduction algorithms help filter out unwanted background noise, enabling the system to perform better in a variety of environments.
- Lower memory usage: Optimizations in the library have resulted in reduced memory usage, ensuring efficient performance even on resource-constrained devices.
- Expanded documentation: The updated library comes with comprehensive documentation, making it easier for developers to integrate the library and understand its functionalities.
With ESP-SR v1.2.0, you can develop voice-controlled applications on Espressif microcontroller platforms with increased accuracy and efficiency. Download the latest release and explore the full potential of our Speech Recognition Library.
Socs support
- ESP32
- ESP32-S3
- ESP32-S2, ESP32-C3(only support Chinese TTS)
MultiNet6
Methods
Training Framework
-
The RNN-Transducer (RNN-T) framework is used to train MultiNet6. RNN-T combines high accuracy with naturally streaming recognition, which is a good choice for embedding system.
-
To accelerate RNN-T inference further, MultiNet6 skip some decoding frames based on CTC-based guidance. Please refer to "Accelerating RNN-T Training and Inference Using CTC guidance" for details of this method.
Model
The model of MultiNet6 is based on Emformer. The parameters of MultiNet6 is about 3.5M. The CPU and memory consumption is as shown below(tested on ESP32S3):
parameters | CPU(one core) | PSRAM | SRAM | |
---|---|---|---|---|
MultiNet5 | 2.2 M | 44% | 2310 KB | 16 KB |
MultiNet6 | 3.5 M | 40% | 4000 KB | 48 KB |
Compared to MultiNet5, MultiNet6 has more parameters and less computation. The main reason is that the encoder model of MultiNet6 has a 4x downsampling factor.
Same as MultiNet5, the weight of Multinet6 is quantized by 8-bit.
Decoding
MultiNet6 uses the Finite-State Transducer (FST) to build language model instead of the Trie used in MultiNet5. The beam search based on FST is more efficient and robust. The different units are used for different language:
- For English language, MultiNet6 uses subword units to replace phonemes (used in MultiNet5). The subword units are trained by SentencePiece. Please refer to mn6_en.vocab for subword units.
- For Chinese language, MultiNet6 uses Pinyin syllables without tone. Please refer to mn6_cn.vocab for Pinyin syllables.
Performance
Word Error Rate (WER) Performance Test
The WER results for some popular open source dataset.
aishell test | |
---|---|
MultiNet5_cn | 9.5% |
MultiNet6_cn | 5.2% |
librispeech test-clean | librispeech test-other | |
---|---|---|
MultiNet5_en | 16.5% | 41.4% |
MultiNet6_en | 9.0% | 21.3% |
Note: The Pinyin syllables without tone is used to calculate WER of Chinese.
Speech Commands Performance Test
The Response Accuray Rate (RAR) results for Espressif speech commands dataset.
Model Type | Distance | Quiet | Stationary Noise SNR=(5~10 dB) | Speech Noise (SNR=5~10dB) |
---|---|---|---|---|
MultiNet5_cn | 3m | 88.9% | 66.1% | 67.5% |
MultiNet6_cn | 3m | 98.8% | 88.3% | 88.0% |
MultiNet5_en | 3m | 95.4% | 85.9% | 82.7% |
MultiNet6_en | 3m | 96.8% | 87.9% | 85.5% |
Please refer to benchmark documentation for the latest results.
How to use MultiNet6
Please refer to documentation of speech_command_recognition for details.
Build System
- Support to install ESP-SR from IDF component manager
- Remove ESP-DSP static library and install it from IDF component manager
- Remove all make compilation scripts