Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Multimodal visual qna using two advanced models-BLIP & CLIP #1188

Conversation

Panchadip-128
Copy link
Contributor

@Panchadip-128 Panchadip-128 commented Nov 9, 2024

his repository contains an implementation of a Visual Question Answering (VQA) model built using the BLIP (Bootstrapping Language-Image Pre-training) framework. This model can understand image content and answer questions related to the provided images. A tool that reads an image based on ML algorithms( BLIP model) and implements Visual QA which answers questions based on user prompts for the image, deployed through Gradio web application

Overview
This project focuses on developing Visual Question Answering (VQA) systems using two models: CLIP (Contrastive Language-Image Pretraining) and BLIP (Bootstrapped Language-Image Pretraining). The goal of VQA is to answer questions about an image by jointly learning from both textual and visual information. This project demonstrates how to utilize the CLIP and BLIP models for VQA tasks, and it includes the training, validation, and testing procedures, as well as metrics for evaluation.

Data Overview:
The dataset used for this project is sourced from the VizWiz 2023 VQA challenge. It contains three main components:

train.json: The training set with image URLs, questions, and answers. val.json: The validation set. test.json: The test set without answer labels for evaluation purposes. Each entry in the dataset includes:

Image: The visual input for the model. Question: The textual question that the model needs to answer. Answers: A list of possible answers (for training/validation) and their confidences. The dataset is preprocessed to create a balanced set by stratifying based on the answerable and answer_type labels.

CLIP-Based VQA Model
Overview:
The CLIP-based VQA model uses the pre-trained CLIP model to extract both visual and textual embeddings. The two embeddings are concatenated to form a single vector, which is then passed through a fully connected network to predict the answer.
qna3
qna2
qna-1

Copy link

github-actions bot commented Nov 9, 2024

Thank you for submitting your pull request! 🙌 We'll review it as soon as possible. If there are any specific instructions or feedback regarding your PR, we'll provide them here. Thanks again for your contribution! 😊

@Panchadip-128
Copy link
Contributor Author

@Niketkumardheeryan Please review as time is less

@Niketkumardheeryan
Copy link
Owner

@Panchadip-128 expecting more modular code with proper comments

@Panchadip-128
Copy link
Contributor Author

@Niketkumardheeryan Done sir that one file also i made the code into modules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants