You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes, this is definitely something we have discussed and would help improve things like YouTube subtitles.
AWS Transcribe supports generation of WebVTT and SRT subtitles already. Of course, we would prefer to use Whisper segments instead of Transcribe for the subtitle text. The challenge here is that the timing granularity of Whisper output is not as fine as with Transcribe, so you end up with longish segments for each timestamp. This may not be desirable for many subtitle uses.
It may be possible to improve the merging algorithm to match the Transcribe timings to the Whisper output, but that seems not so trivial.
Perhaps the simplest thing initially is to just generate VTT or SRT from the Whisper output. This could be done in a separate Lambda function using the merged transcript output (after the Process Transcripts state).
It could use the JSON object located at processedTranscriptKey as its input.
If you are interested in contributing this feature, that would be very welcome, @nodomain. We are happy to review and support of course.
Are there plans yet to support other output formats like WebVTT, plain text or SRT?
Still digging through the solution and thinking about adding a converter but I am not sure about the correct approach.
The text was updated successfully, but these errors were encountered: