Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Real-Time Speech-to-Text with Whisper Model 🎙️ #14

Open
kadirnar opened this issue Nov 23, 2023 · 22 comments
Open
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed

Comments

@kadirnar
Copy link
Owner

Implement real-time functionality for the Whisper model, enabling it to transcribe speech into text as the user speaks🎤

@kadirnar kadirnar added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Nov 23, 2023
@kadirnar kadirnar self-assigned this Nov 23, 2023
@trivikramak
Copy link

Is there any progress on this?

@kadirnar
Copy link
Owner Author

Is there any progress on this?

I don't have enough time to develop this. That's why this feature is not currently being developed. I will add it later.

@trivikramak
Copy link

trivikramak commented Jan 20, 2024

I'm trying to use tiny models for on-device (mobile) near-real-time speech-to-text.
Can you suggest some direction or pointers if we have to implement this?

@kadirnar
Copy link
Owner Author

You can use distiller-whisper models.
Models: https://huggingface.co/distil-whisper

image

@trivikramak
Copy link

trivikramak commented Jan 20, 2024

Thanks for the reply, from what I have read, I understood that the idea should be

  1. chunking the audio at a randomly chosen small fixed time .. (say 3s)
  2. padding it with silence to make it a 30s chunk
  3. sending it to the whisper model for inference

Is there any better approach. It seems very inefficient to run inference on 30s chunks for a real-time streaming transcription. Am I missing something?

@kadirnar
Copy link
Owner Author

I don't understand If you want to use the whisper model in real-time, you can look at this library.

https://github.com/davabase/whisper_real_time

@Nishant-Kumar-2002
Copy link

Hi,
I have worked with whisper models in real time transcription but the catch is in hls stream the video or audio is generated at a buffer of 6 seconds. So, we can use ffmpeg and threading to chunk out that clip and then transcribe.

@kadirnar
Copy link
Owner Author

Hi, I have worked with whisper models in real time transcription but the catch is in hls stream the video or audio is generated at a buffer of 6 seconds. So, we can use ffmpeg and threading to chunk out that clip and then transcribe.

Can you add Real-Time feature?

@Nishant-Kumar-2002
Copy link

Nishant-Kumar-2002 commented Jan 21, 2024

If we are doing on a streaming service then it take buffer time of 6 sec.
In real time we can clip second by second.

@kadirnar
Copy link
Owner Author

If we are doing on a streaming service then it take buffer time of 6 sec. In real time we can clip second by second.

I will research this issue.

@kadirnar
Copy link
Owner Author

kadirnar commented Jan 24, 2024

Hi @Nishant-Kumar-2002 , can you review this code? This feature adds subtitles to the video.

@Nishant-Kumar-2002
Copy link

Ok will check that.

@Nishant-Kumar-2002
Copy link

Code looks good to me.

@MilanaShhanukova
Copy link

May I ask if the main idea is to implement real-time whisper to transcribe speech through the microphone or transcribe audio files in real-time to a file, so that we do not have to wait until the end of the audio?

@kadirnar
Copy link
Owner Author

kadirnar commented Feb 3, 2024

May I ask if the main idea is to implement real-time whisper to transcribe speech through the microphone or transcribe audio files in real-time to a file, so that we do not have to wait until the end of the audio?

I want to do the first thing you said.

@fraschm1998
Copy link

fraschm1998 commented Apr 12, 2024

Any update on this? Would love real-time transcription of speech through a mic

@Nishant-Kumar-2002
Copy link

Nishant-Kumar-2002 commented Apr 12, 2024

Any update on this? Would love real-time transcription of speech through a mic @kadirnar

I would like to add this new feature.

@kadirnar
Copy link
Owner Author

@Nishant-Kumar-2002 Wonderful news 👍🏻 I'm waiting for the pull request.

@kadirnar
Copy link
Owner Author

kadirnar commented May 2, 2024

I started coding. I will add this support over the weekend.

@fraschm1998
Copy link

I started coding. I will add this support over the weekend.

Awesome looking forward to this! Thanks for your amazing work!

@SeeknnDestroy
Copy link

thanks for the awesome work @kadirnar! any eta on this?

@kadirnar
Copy link
Owner Author

thanks for the awesome work @kadirnar! any eta on this?

There are a few problems with real-time. It may take a while to figure it out. I'm developing for Autopipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants