From 763eec4f25f1a6917292561e325768ac4ad3e7e2 Mon Sep 17 00:00:00 2001 From: KoljaB Date: Sun, 3 Nov 2024 12:43:12 +0100 Subject: [PATCH] upgrades to interface and stability --- RealtimeSTT_server/README.md | 269 +++++++++++++++++++++++++++++++---- 1 file changed, 239 insertions(+), 30 deletions(-) diff --git a/RealtimeSTT_server/README.md b/RealtimeSTT_server/README.md index b52dce9..e246e58 100644 --- a/RealtimeSTT_server/README.md +++ b/RealtimeSTT_server/README.md @@ -54,50 +54,219 @@ The server will initialize and begin listening for WebSocket connections on the You can configure the server using the following command-line arguments: -- `--model` (str, default: `'medium.en'`): Path to the STT model or model size. Options include `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large-v1`, `large-v2`, or any Hugging Face CTranslate2 STT model like `deepdml/faster-whisper-large-v3-turbo-ct2`. +### Available Parameters: -- `--realtime_model_type` (str, default: `'tiny.en'`): Model size for real-time transcription. Same options as `--model`. +#### `-m`, `--model` -- `--language` (str, default: `'en'`): Language code for the STT model. Leave empty for auto-detection. +- **Type**: `str` +- **Default**: `'large-v2'` +- **Description**: Path to the Speech-to-Text (STT) model or specify a model size. Options include: `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large-v1`, `large-v2`, or any HuggingFace CTranslate2 STT model such as `deepdml/faster-whisper-large-v3-turbo-ct2`. -- `--input_device_index` (int, default: `1`): Index of the audio input device to use. +#### `-r`, `--rt-model`, `--realtime_model_type` -- `--silero_sensitivity` (float, default: `0.05`): Sensitivity for Silero VAD. Lower values are less sensitive. +- **Type**: `str` +- **Default**: `'tiny.en'` +- **Description**: Model size for real-time transcription. Options are the same as for `--model`. This is used only if real-time transcription is enabled (`--enable_realtime_transcription`). -- `--webrtc_sensitivity` (float, default: `3`): Sensitivity for WebRTC VAD. Higher values are less sensitive. +#### `-l`, `--lang`, `--language` -- `--min_length_of_recording` (float, default: `1.1`): Minimum duration (in seconds) for a valid recording. +- **Type**: `str` +- **Default**: `'en'` +- **Description**: Language code for the STT model to transcribe in a specific language. Leave this empty for auto-detection based on input audio. Default is `'en'`. [List of supported language codes](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L11-L110). -- `--min_gap_between_recordings` (float, default: `0`): Minimum time (in seconds) between consecutive recordings. +#### `-i`, `--input-device`, `--input_device_index` -- `--enable_realtime_transcription` (flag, default: `True`): Enable real-time transcription of audio. +- **Type**: `int` +- **Default**: `1` +- **Description**: Index of the audio input device to use. Use this option to specify a particular microphone or audio input device based on your system. -- `--realtime_processing_pause` (float, default: `0.02`): Time interval (in seconds) between processing audio chunks for real-time transcription. +#### `-c`, `--control`, `--control_port` -- `--silero_deactivity_detection` (flag, default: `True`): Use Silero model for end-of-speech detection. +- **Type**: `int` +- **Default**: `8011` +- **Description**: The port number used for the control WebSocket connection. Control connections are used to send and receive commands to the server. -- `--early_transcription_on_silence` (float, default: `0.2`): Start transcription after specified seconds of silence. +#### `-d`, `--data`, `--data_port` -- `--beam_size` (int, default: `5`): Beam size for the main transcription model. +- **Type**: `int` +- **Default**: `8012` +- **Description**: The port number used for the data WebSocket connection. Data connections are used to send audio data and receive transcription updates in real time. -- `--beam_size_realtime` (int, default: `3`): Beam size for the real-time transcription model. +#### `-w`, `--wake_words` -- `--initial_prompt` (str): Initial prompt for the transcription model to guide its output format and style. +- **Type**: `str` +- **Default**: `""` (empty string) +- **Description**: Specify the wake word(s) that will trigger the server to start listening. For example, setting this to `"Jarvis"` will make the system start transcribing when it detects the wake word `"Jarvis"`. -- `--end_of_sentence_detection_pause` (float, default: `0.45`): Duration of pause (in seconds) to consider as the end of a sentence. +#### `-D`, `--debug` -- `--unknown_sentence_detection_pause` (float, default: `0.7`): Duration of pause (in seconds) to consider as an unknown or incomplete sentence. +- **Action**: `store_true` +- **Description**: Enable debug logging for detailed server operations. -- `--mid_sentence_detection_pause` (float, default: `2.0`): Duration of pause (in seconds) to consider as a mid-sentence break. +#### `-W`, `--write` -- `--control_port` (int, default: `8011`): Port for the control WebSocket connection. +- **Metavar**: `FILE` +- **Description**: Save received audio to a WAV file. -- `--data_port` (int, default: `8012`): Port for the data WebSocket connection. +#### `--silero_sensitivity` + +- **Type**: `float` +- **Default**: `0.05` +- **Description**: Sensitivity level for Silero Voice Activity Detection (VAD), with a range from `0` to `1`. Lower values make the model less sensitive, useful for noisy environments. + +#### `--silero_use_onnx` + +- **Action**: `store_true` +- **Default**: `False` +- **Description**: Enable the ONNX version of the Silero model for faster performance with lower resource usage. + +#### `--webrtc_sensitivity` + +- **Type**: `int` +- **Default**: `3` +- **Description**: Sensitivity level for WebRTC Voice Activity Detection (VAD), with a range from `0` to `3`. Higher values make the model less sensitive, useful for cleaner environments. + +#### `--min_length_of_recording` + +- **Type**: `float` +- **Default**: `1.1` +- **Description**: Minimum duration of valid recordings in seconds. This prevents very short recordings from being processed, which could be caused by noise or accidental sounds. + +#### `--min_gap_between_recordings` + +- **Type**: `float` +- **Default**: `0` +- **Description**: Minimum time (in seconds) between consecutive recordings. Setting this helps avoid overlapping recordings when there's a brief silence between them. + +#### `--enable_realtime_transcription` + +- **Action**: `store_true` +- **Default**: `True` +- **Description**: Enable continuous real-time transcription of audio as it is received. When enabled, transcriptions are sent in near real-time. + +#### `--realtime_processing_pause` + +- **Type**: `float` +- **Default**: `0.02` +- **Description**: Time interval (in seconds) between processing audio chunks for real-time transcription. Lower values increase responsiveness but may put more load on the CPU. + +#### `--silero_deactivity_detection` + +- **Action**: `store_true` +- **Default**: `True` +- **Description**: Use the Silero model for end-of-speech detection. This option can provide more robust silence detection in noisy environments, though it consumes more GPU resources. + +#### `--early_transcription_on_silence` + +- **Type**: `float` +- **Default**: `0.2` +- **Description**: Start transcription after the specified seconds of silence. This is useful when you want to trigger transcription mid-speech when there is a brief pause. Should be lower than `post_speech_silence_duration`. Set to `0` to disable. + +#### `--beam_size` + +- **Type**: `int` +- **Default**: `5` +- **Description**: Beam size for the main transcription model. Larger values may improve transcription accuracy but increase the processing time. + +#### `--beam_size_realtime` + +- **Type**: `int` +- **Default**: `3` +- **Description**: Beam size for the real-time transcription model. A smaller beam size allows for faster real-time processing but may reduce accuracy. + +#### `--initial_prompt` + +- **Type**: `str` +- **Default**: + + ``` + End incomplete sentences with ellipses. Examples: + Complete: The sky is blue. + Incomplete: When the sky... + Complete: She walked home. + Incomplete: Because he... + ``` + +- **Description**: Initial prompt that guides the transcription model to produce transcriptions in a particular style or format. The default provides instructions for handling sentence completions and ellipsis usage. + +#### `--end_of_sentence_detection_pause` + +- **Type**: `float` +- **Default**: `0.45` +- **Description**: The duration of silence (in seconds) that the model should interpret as the end of a sentence. This helps the system detect when to finalize the transcription of a sentence. + +#### `--unknown_sentence_detection_pause` + +- **Type**: `float` +- **Default**: `0.7` +- **Description**: The duration of pause (in seconds) that the model should interpret as an incomplete or unknown sentence. This is useful for identifying when a sentence is trailing off or unfinished. + +#### `--mid_sentence_detection_pause` + +- **Type**: `float` +- **Default**: `2.0` +- **Description**: The duration of pause (in seconds) that the model should interpret as a mid-sentence break. Longer pauses can indicate a pause in speech but not necessarily the end of a sentence. + +#### `--wake_words_sensitivity` + +- **Type**: `float` +- **Default**: `0.5` +- **Description**: Sensitivity level for wake word detection, with a range from `0` (most sensitive) to `1` (least sensitive). Adjust this value based on your environment to ensure reliable wake word detection. + +#### `--wake_word_timeout` + +- **Type**: `float` +- **Default**: `5.0` +- **Description**: Maximum time in seconds that the system will wait for a wake word before timing out. After this timeout, the system stops listening for wake words until reactivated. + +#### `--wake_word_activation_delay` + +- **Type**: `float` +- **Default**: `20` +- **Description**: The delay in seconds before the wake word detection is activated after the system starts listening. This prevents false positives during the start of a session. + +#### `--wakeword_backend` + +- **Type**: `str` +- **Default**: `'none'` +- **Description**: The backend used for wake word detection. You can specify different backends such as `"default"` or any custom implementations depending on your setup. + +#### `--openwakeword_model_paths` + +- **Type**: `str` (accepts multiple values) +- **Description**: A list of file paths to OpenWakeWord models. This is useful if you are using OpenWakeWord for wake word detection and need to specify custom models. + +#### `--openwakeword_inference_framework` + +- **Type**: `str` +- **Default**: `'tensorflow'` +- **Description**: The inference framework to use for OpenWakeWord models. Supported frameworks could include `"tensorflow"`, `"pytorch"`, etc. + +#### `--wake_word_buffer_duration` + +- **Type**: `float` +- **Default**: `1.0` +- **Description**: Duration of the buffer in seconds for wake word detection. This sets how long the system will store the audio before and after detecting the wake word. + +#### `--use_main_model_for_realtime` + +- **Action**: `store_true` +- **Description**: Enable this option if you want to use the main model for real-time transcription, instead of the smaller, faster real-time model. Using the main model may provide better accuracy but at the cost of higher processing time. + +#### `--use_extended_logging` + +- **Action**: `store_true` +- **Description**: Writes extensive log messages for the recording worker that processes the audio chunks. + +#### `--logchunks` + +- **Action**: `store_true` +- **Description**: Enable logging of incoming audio chunks (periods). **Example:** ```bash -stt-server --model small.en --language en --control_port 9001 --data_port 9002 +stt-server -m small.en -l en -c 9001 -d 9002 ``` ## Client Usage @@ -112,26 +281,66 @@ stt [OPTIONS] The client connects to the STT server's control and data WebSocket URLs to facilitate real-time speech transcription and control. -### Client Parameters +### Available Parameters for STT Client: + +#### `-c`, `--control`, `--control_url` + +- **Type**: `str` +- **Default**: `DEFAULT_CONTROL_URL` +- **Description**: Specifies the STT control WebSocket URL used for sending and receiving commands to/from the STT server. + +#### `-d`, `--data`, `--data_url` + +- **Type**: `str` +- **Default**: `DEFAULT_DATA_URL` +- **Description**: Specifies the STT data WebSocket URL used for transmitting audio data and receiving transcription updates. + +#### `-D`, `--debug` + +- **Action**: `store_true` +- **Description**: Enables debug mode, providing detailed output for server-client interactions. + +#### `-n`, `--norealtime` + +- **Action**: `store_true` +- **Description**: Disables real-time output, preventing transcription updates from being shown live as they are processed. + +#### `-W`, `--write` + +- **Metavar**: `FILE` +- **Description**: Saves recorded audio to a specified WAV file for later playback or analysis. + +#### `-s`, `--set` -- `--control-url` (default: `ws://localhost:8011`): The WebSocket URL for server control commands. +- **Type**: `list` +- **Metavar**: `('PARAM', 'VALUE')` +- **Action**: `append` +- **Description**: Sets a parameter for the recorder. Can be used multiple times to set different parameters. Each occurrence must be followed by the parameter name and value. -- `--data-url` (default: `ws://localhost:8012`): The WebSocket URL for sending audio data and receiving transcription updates. +#### `-m`, `--method` -- `--debug`: Enable debug mode, which prints detailed logs to `stderr`. +- **Type**: `list` +- **Metavar**: `METHOD` +- **Action**: `append` +- **Description**: Calls a specified method on the recorder with optional arguments. Multiple methods can be invoked by repeating this parameter. -- `--nort` or `--norealtime`: Disable real-time output of transcription results. +#### `-g`, `--get` -- `--set-param PARAM VALUE`: Set a recorder parameter (e.g., `silero_sensitivity`, `beam_size`). This option can be used multiple times. +- **Type**: `list` +- **Metavar**: `PARAM` +- **Action**: `append` +- **Description**: Retrieves the value of a specified recorder parameter. Can be used multiple times to get multiple parameter values. -- `--get-param PARAM`: Retrieve the value of a specific recorder parameter. Can be used multiple times. +#### `-l`, `--loop` -- `--call-method METHOD [ARGS]`: Call a method on the recorder with optional arguments. Can be used multiple times. +- **Action**: `store_true` +- **Description**: Runs the client in a loop, allowing it to continuously transcribe speech without exiting after each session. **Example:** ```bash -stt --set-param silero_sensitivity 0.1 --get-param silero_sensitivity +stt -s silero_sensitivity 0.1 +stt -g silero_sensitivity ``` ## WebSocket Interface