This repository provides a collection of examples demonstrating the capabilities of multimodal AI using the Gemini model. We explore various modalities including image, video, audio, and text, showcasing how to effectively combine these inputs for diverse applications.
The repository is organized into the following directories:
- gemini: Examples specifically utilizing the Gemini model for multimodal tasks.
- image_and_video: Demonstrations of multimodal AI with image and video data.
- audio: Examples focused on audio-based multimodal AI.
- embeddings: Code for generating multimodal embeddings using Gemini.
- Image and Video:
- Image captioning: Generate descriptive captions for images.
- Video object detection: Identify and locate objects within videos.
- Video summarization: Create concise summaries of video content.
- Image-text generation: Generate descriptive text for given images or videos.
- Audio:
- Audio transcription: Convert spoken language into text.
- Embeddings:
- Image embeddings: Generate numerical representations of images.
- Video embeddings: Create numerical representations of videos.
- PDF:
- Add PDFs to Gemini requests to perform tasks that involve understanding the contents of the included PDFs
We welcome contributions to this repository! If you have any improvements, new examples, or bug fixes, please feel free to open a pull request.
Made with ❤ by jggomez.
Copyright 2024 Juan Guillermo Gómez
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.