-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two outputs from gemini-2.0-flash #379
Comments
Audio-out is only available to a selected few early access customers. You can only use the live api at the moment that only outputs audio. |
I second this feature request! I suspect the reason it currently outputs only one or the other is because there probably isn't an intermediate text output head on multimodal llm model and there isn't an intermediate text representation when getting audio output? If this is the case, and we can't expect matched text and audio output in the future, please let us know, as one can nowadays easily wire up a Whisper type speech to text model to get the text from audio, but of course this requires additional overhead. |
I'll route the feature request to the product team. I also agree that it would be great to get both but I'm not sure either of the feasability considering the model natively outputs audio. |
Interesting... In December, I could only get audio or text from the live API (Vertex AI mode) but not both, but no error when requesting both... As of last week up until yesterday, I was getting both text and audio from the live API by specifying But then today, I ran the same code and got error 1007 (invalid frame payload data) and "generic::invalid_argument: Only one of text or audio output is allowed." It's fun to experiment with a rapidly changing tool! But here's one vote for allowing us to continue to be able to get both audio and text. Otherwise, why would response_modalities be a list? 😃 And it was working fine yesterday, after all... 😸 |
We're also looking to implement the Live API to return both the Audio and the Text in the response, and have not had luck specifying both in response_modalities=['AUDIO', 'TEXT']. |
Description of the feature request:
Hello everyone!
I want to make gemini-2.0-flash output to me both text and audio, but when I trying to add
TEXT
variable toresponse_modalities
I get this kind of error :[ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language; then sent 1007 (invalid frame payload data) Request trace id: 4ad28f357e6c292e, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language
What problem are you trying to solve with this feature?
Two outputs for one respons
Any other information you'd like to share?
I could not find any information related to this
The text was updated successfully, but these errors were encountered: