Skip to content

Releases: espressif/esp-sr

ESP-SR Release V1.6

12 Dec 08:03
Compare
Choose a tag to compare

We are delighted to announce the release of our latest models, MultiNet7 and nsnet1, as well as more wake word models trained by TTS samples.

1. MultiNet7: Speech commands recognition model

We are proud to introduce our new MultiNet7 model. This new model is optimized for efficiency, using less memory and reducing compute time while maintaining high accuracy. You can upgrade MultiNet7 from MultiNet6 smoothly.

2. nsnet1 - The first deep noise suppression model

We are also introducing nsnet1, our first deep noise suppression model. This model is designed to enhance speech quality in noisy environments, making it perfect for real-world applications like telephony systems.

nsnet1 uses a deep learning approach to suppress background noise while preserving the original speech signal. It is trained on a large dataset to learn the patterns of noise and effectively cancel them out without distorting the speech.

This model is available for ESP32-S3 chip. You can enable it by setting afe_config.afe_ns_mode = NS_MODE_NET; . Please refer to esp-skainet/examples/voice_communication for more details.

Note: currently only AFE_VC support nsnet1. AFE_SR does not support nsnet1.

3. Wake Word Models Trained by TTS

We have expanded our wake word models trained by TTS (Text-to-Speech) samples to include more wake word for our users. With the combination of TTS and LLM methods, the TTS model can be trained on a large amount of unlabeled audio data by self-supervised learning. The zero-shot performance of the TTS model is significantly improved, which allows us to clone voices based on a large number of short audio clips (less than 10 seconds). Now, we can train a wake word model only by TTS samples.

These wake word models are designed to recognize specific keywords or phrases that trigger an action or response from your device or application.

ESP-SR Release V1.2

17 Mar 03:49
Compare
Choose a tag to compare
Documentation for release ESP-SR is available at https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/index.html

We're excited to announce the release of ESP-SR v1.2.0, an advanced version of Espressif's Speech Recognition Library. This update brings 5 significant enhancements to the previous version and is designed specifically for the ESP32-S3 microcontroller platforms.

  • Improved accuracy: The speech recognition accuracy has been substantially increased, ensuring a more reliable and efficient user experience.
  • Support for custom keywords: You can now easily add your own keywords for customized voice command recognition.
  • Enhanced noise reduction: Our improved noise reduction algorithms help filter out unwanted background noise, enabling the system to perform better in a variety of environments.
  • Lower memory usage: Optimizations in the library have resulted in reduced memory usage, ensuring efficient performance even on resource-constrained devices.
  • Expanded documentation: The updated library comes with comprehensive documentation, making it easier for developers to integrate the library and understand its functionalities.

With ESP-SR v1.2.0, you can develop voice-controlled applications on Espressif microcontroller platforms with increased accuracy and efficiency. Download the latest release and explore the full potential of our Speech Recognition Library.

Socs support

  • ESP32
  • ESP32-S3
  • ESP32-S2, ESP32-C3(only support Chinese TTS)

MultiNet6

Methods

Training Framework

Model

The model of MultiNet6 is based on Emformer. The parameters of MultiNet6 is about 3.5M. The CPU and memory consumption is as shown below(tested on ESP32S3):

parameters CPU(one core) PSRAM SRAM
MultiNet5 2.2 M 44% 2310 KB 16 KB
MultiNet6 3.5 M 40% 4000 KB 48 KB

Compared to MultiNet5, MultiNet6 has more parameters and less computation. The main reason is that the encoder model of MultiNet6 has a 4x downsampling factor.
Same as MultiNet5, the weight of Multinet6 is quantized by 8-bit.

Decoding

MultiNet6 uses the Finite-State Transducer (FST) to build language model instead of the Trie used in MultiNet5. The beam search based on FST is more efficient and robust. The different units are used for different language:

  • For English language, MultiNet6 uses subword units to replace phonemes (used in MultiNet5). The subword units are trained by SentencePiece. Please refer to mn6_en.vocab for subword units.
  • For Chinese language, MultiNet6 uses Pinyin syllables without tone. Please refer to mn6_cn.vocab for Pinyin syllables.

Performance

Word Error Rate (WER) Performance Test

The WER results for some popular open source dataset.

aishell test
MultiNet5_cn 9.5%
MultiNet6_cn 5.2%
librispeech test-clean librispeech test-other
MultiNet5_en 16.5% 41.4%
MultiNet6_en 9.0% 21.3%

Note: The Pinyin syllables without tone is used to calculate WER of Chinese.

Speech Commands Performance Test

The Response Accuray Rate (RAR) results for Espressif speech commands dataset.

Model Type Distance Quiet Stationary Noise SNR=(5~10 dB) Speech Noise (SNR=5~10dB)
MultiNet5_cn 3m 88.9% 66.1% 67.5%
MultiNet6_cn 3m 98.8% 88.3% 88.0%
MultiNet5_en 3m 95.4% 85.9% 82.7%
MultiNet6_en 3m 96.8% 87.9% 85.5%

Please refer to benchmark documentation for the latest results.

How to use MultiNet6

Please refer to documentation of speech_command_recognition for details.

Build System

  • Support to install ESP-SR from IDF component manager
  • Remove ESP-DSP static library and install it from IDF component manager
  • Remove all make compilation scripts

ESP-SR Release V1.0

17 Dec 03:37
Compare
Choose a tag to compare

ESP-SR Release V1.0 is the first release for ESP-SR, this release has some features and performance improvements. Include the following:

Socs support

  • ESP32
  • ESP32-S2(only support TTS)
  • ESP32-S3

Wake Word Engine

Speech Command Recognition

  • Optimize CPU loading and memory usage
  • Support Chinese speech command recognition model: MultiNet 4.5
  • Support Englsh speech command recognition: MultiNet 5
  • Support storage model data in spiffs partition
  • Support threshold setting and acquisition of each phrase
  • Support reset speech commands in APP code and custom speech command words

Audio Front End

  • Support AEC/BSS(NS on ESP32)/VAD/WakeNet
  • Supports both single MIC and dual MIC scenarios
  • Support personalized AFE configuration
  • Support all functions bypass
  • Support more flexible memory configuration

Build System

  • Support CMake on ESP32/ESP32-S2/ESP32-S3
  • Support Make on ESP32