Visual Translator 🖼️📣

Welcome to the Visual Translator! This project was developed during the Entrepreneur First Hackathon 2025 in Zurich🇨🇭. The goal was to create a translation companion that enhances the way people interact with their environment, enabling them to translate objects they can see and capture in real time.

By simply loading an image and clicking on the object you want to observe and translate, you’ll receive not only the translation but also three sample sentences in the target language, coherent with the context of the image.

Demo

The models can be accessed through an intuitive Gradio interface. Launch the application and start translating effortlessly!

🚀 Features

Upload or Select Images: Choose an image from your computer, camera, or use one of the default examples to get started.
Click to Annotate: Click on the object you want to translate, and the system will annotate the image with an arrow.
Model Selection: Choose from two vision-language models, SmolVLM2 and PaliGemma 2 mix.
Translation and Descriptions: Receive the translation of the object and three sample sentences coherent with the context. Uses Llama-3.2 version.
Text to Audio: Hear how to pronounce the word and the sentences in the target language, powered by ElevenLabs.
Gradio Interface: A user-friendly interface that makes it easy to interact with the tool.

🛠 Installation

Clone the repository:

git clone https://github.com/massimilianoviola/visual-translator.git

Install the dependencies (a Python virtual environment is recommended, e.g. conda):

pip install -r requirements.txt

Run the demo (an ElevenLabs API key is required):

ELEVENLABS_API_KEY=YOUR_API_KEY python gradio_demo.py

🌍 Usage

Upload an image or choose one from the example options.
Click on the image to place an arrow on the object you wish to translate.
Select the vision-language model you prefer.
Click the "Generate Description" button to get the translated term.
Select the target language, among the ones supported by Llama-3.2.
Click the "Translate" button to get the translation of the object and the three sample sentences.
Click the "Vocalize All" button to get the audio of the translation and sample sentences.

🤝 Team Members

This project is brought to you by the talented team members participating in the Entrepreneur First Hackathon 2025, Zurich:

Massimiliano Viola
You Wu
Timur Taepov

🔑 Notes

This demo uses two pre-trained VLM models: SmolVLM2 and PaliGemma 2 mix. It also uses Llama-3.2 for translation and sentence generation. Make sure you have access to these models from Hugging Face and load an HF access token via cli.
Ensure that your ElevenLabs API key is passed as an environment variable to the demo or set as the global variable ELEVENLABS_API_KEY.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md
gradio_demo.py		gradio_demo.py
image_descriptor.py		image_descriptor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Translator 🖼️📣

Demo

🚀 Features

🛠 Installation

🌍 Usage

🤝 Team Members

🔑 Notes

About

Contributors 3

Languages

massimilianoviola/visual-translator

Folders and files

Latest commit

History

Repository files navigation

Visual Translator 🖼️📣

Demo

🚀 Features

🛠 Installation

🌍 Usage

🤝 Team Members

🔑 Notes

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages