No Transcription Recording only with VAD #49

shahzaib78631 · 2024-11-07T03:04:39Z

Hello Everyone 🤗.
I have been working on an app where i need to implement VAD (Voice Activity Detection) wheen user starts speaking 🗣️ i want to record the audio and send it to the server. I found this package which is able to transcribe when user starts speaking but instead of transcript what i want is the audio only. I have also read about the recording property in the package but it supports only Android 13+. Is there any way to only get the audio with VAD on lower android versions too?

jamsch · 2024-11-07T04:34:21Z

Hey @shahzaib78631, unfortunately the Android 12 and lower SpeechRecognition APIs don't allow us to use a custom microphone source for the recognition. I may be able to use a workaround to at least get the underlying recognized buffers, however even if I do implement this, Android 12 and lower don't support continuous speech recognition which is probably a requirement here if you're building a communications app.

This library is probably not the best fit for VAD either if you're not interested in the voice transcripts (due to the tight integration with the underlying APIs). I'd rather consider looking in to the following libraries:

Each of these libraries give you a way to access real time frame chunks which you can use to check if there's voice activity. Perhaps implementing a gain filter could be appropriate for your use cases. Otherwise you may want to use a model to process those frames.

shahzaib78631 · 2024-11-07T05:22:23Z

Hey @jamsch,

Thank you for the detailed explanation and quick response. For now if possible and not a time consuming task could you provide the workaround to get the underlying recognized buffers ?.

I appreciate your recommendations for alternative libraries and I'll definitely look into them.

Thank You.

jamsch · 2024-11-07T06:11:56Z

Hey @shahzaib78631, I'm not exactly sure if it's going to even be worth implementing the audio capture workaround for Android 12 and lower due to it not supporting continuous recognition, i.e. you'll have to manually start speech recognition again each time it stops (which will need to happen at least 10 times per minute).

I think for cases like VAD you shouldn't need such a resource intensive process as speech recognition and instead you'd want to opt for something that can process the audio frame (using one of the libraries I mentioned above) and then either applying a gain filter to it (which is generally straightforward), or sending it to an API, or using on-device model like Cobra: https://github.com/Picovoice/cobra

shahzaib78631 · 2024-11-07T06:39:41Z

Thank You bro

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Transcription Recording only with VAD #49

No Transcription Recording only with VAD #49

shahzaib78631 commented Nov 7, 2024

jamsch commented Nov 7, 2024

shahzaib78631 commented Nov 7, 2024

jamsch commented Nov 7, 2024

shahzaib78631 commented Nov 7, 2024

No Transcription Recording only with VAD #49

No Transcription Recording only with VAD #49

Comments

shahzaib78631 commented Nov 7, 2024

jamsch commented Nov 7, 2024

shahzaib78631 commented Nov 7, 2024

jamsch commented Nov 7, 2024

shahzaib78631 commented Nov 7, 2024