Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSI function for transcribe audio buffer #52

Open
jhen0409 opened this issue Jun 7, 2023 · 3 comments
Open

JSI function for transcribe audio buffer #52

jhen0409 opened this issue Jun 7, 2023 · 3 comments
Labels
enhancement New feature or request
Milestone

Comments

@jhen0409
Copy link
Member

jhen0409 commented Jun 7, 2023

Provide JSI function for transcribe audio buffer, so we can use library like react-native-audio-pcm-stream or from another source, and we can manage recorded audio samples on JS without writing platform specific code.

Compare to native bridge, JSI can convert buffer from JS in high performance.

@simonwh
Copy link

simonwh commented Jul 3, 2024

Any progress on this one? :)

@deeeed
Copy link

deeeed commented Dec 10, 2024

Hi @jhen0409 ,

After implementing non-live transcription in my audio playground, I'd like to discuss approaches for live transcription integration between expo-audio-stream and whisper.rn. I see two main paths:

  1. JavaScript-Level Integration (Currently Implemented)
    Approach:
  • Using expo-audio-stream's base64 PCM data stream
  • Interfacing with whisper.rn's transcribeData API
  • Managing buffering and transcription state in JavaScript
    Pros:
  • Simpler to implement initially
  • More flexible for different use cases
  • Platform-agnostic implementation
    Cons:
  • Multiple base64 conversions
  • Higher memory usage
  • JavaScript bridge overhead
  1. Native-Level Integration (Proposed)
    Approach:
  • Direct PCM data handling at native layer
  • Add new native methods in whisper.rn for streaming PCM
  • Implement efficient buffer management between expo-audio-stream and whisper.rn
    Implementation Suggestion:
  • Add startRealtimeTranscribeWithAudioInput method to handle streaming setup
  • Implement receiveAudioDataChunk for direct PCM data processing
  • Connect directly to whisper.cpp's audio processing pipeline
  • Use circular buffers for efficient memory management

Questions:

  1. Would you be open to adding these new streaming-focused methods to whisper.rn? This would allow direct PCM handling without base64 conversion overhead.
  2. Should we extend the existing realtime API or create a new pathway specifically for external audio sources?

I believe the native integration path would provide better performance for real-time use cases, but I'd appreciate your thoughts on this approach. Happy to contribute PRs once we align on the best path forward.

Looking forward to your feedback!

@jhen0409
Copy link
Member Author

Hi @jhen0409 ,

After implementing non-live transcription in my audio playground, I'd like to discuss approaches for live transcription integration between expo-audio-stream and whisper.rn. I see two main paths:

  1. JavaScript-Level Integration (Currently Implemented)
    Approach:
  • Using expo-audio-stream's base64 PCM data stream
  • Interfacing with whisper.rn's transcribeData API
  • Managing buffering and transcription state in JavaScript
    Pros:
  • Simpler to implement initially
  • More flexible for different use cases
  • Platform-agnostic implementation
    Cons:
  • Multiple base64 conversions
  • Higher memory usage
  • JavaScript bridge overhead
  1. Native-Level Integration (Proposed)
    Approach:
  • Direct PCM data handling at native layer
  • Add new native methods in whisper.rn for streaming PCM
  • Implement efficient buffer management between expo-audio-stream and whisper.rn
    Implementation Suggestion:
  • Add startRealtimeTranscribeWithAudioInput method to handle streaming setup
  • Implement receiveAudioDataChunk for direct PCM data processing
  • Connect directly to whisper.cpp's audio processing pipeline
  • Use circular buffers for efficient memory management

Questions:

  1. Would you be open to adding these new streaming-focused methods to whisper.rn? This would allow direct PCM handling without base64 conversion overhead.
  1. Should we extend the existing realtime API or create a new pathway specifically for external audio sources?

I believe the native integration path would provide better performance for real-time use cases, but I'd appreciate your thoughts on this approach. Happy to contribute PRs once we align on the best path forward.

Looking forward to your feedback!

I'd like to extend current transcribeRealtime implementation to support other audio sources, it may like this:

transcribeRealtime({
  /** [NEW option] Choose audio source (custom: put by yourself in JS or native side */
  source: 'built-in' | 'custom',
  // ...
}): Promise<{
  /** Stop the realtime transcribe */
  stop: () => Promise<void>
  /** Subscribe to realtime transcribe events */
  subscribe: (callback: (event: TranscribeRealtimeEvent) => void) => void
  /** [NEW method] Put audio buffer (Buffer or base64 encoded string) for `custom` source */
  pushAudioDataChunk: (data) => void
}>

The pushAudioDataChunk native method will implement by cpp/JSI directly, it may use the same method with react-native-blob-jsi-helper. It would be better if we could move context pool & jobs into JSI, so we don't need to have to use Blob module and avoid JNI costs on Android, but that would probably a big refactor.

Also, we can expose a static method for put audio data to a realtime-transcription job in the native side, so that can use by a custom audio stream native module.

For the transcribeData method, we will also support array buffer by use the same method with blob-jsi-helper, this is the main purpose in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants