Ly-Duyen Tran (Allie)
[email protected]
- Introduction to Lifelogging
- Lifelog Retrieval
- Concept-based Retrieval
- CLIP model
- Miscellaneous
- Lifelogging is the process of tracking personal data generated by our own daily activities.
- It is a way to record and store our memories, experiences, and emotions.
- Personal memory enhancement
- Health monitoring
- Personal data analysis
- Lifelog retrieval is the process of searching and retrieving lifelog data.
- It is a challenging task due to the large amount of data and the need for efficient retrieval methods.
I needed to buy a blood pressure monitor. So I was looking in a pharmacy that sold Omron and Braun devices.
- For demonstration purposes, I will use the test dataset from the Lifelog Search Challenge 2022.
- Image: ![[lifelog_example.png]]
- Metadata: csv file
ImageID,new_lat,new_lng,semantic_name,city,country,timezone
20190105_164008_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin
20190105_164040_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin
20190105_164112_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin
20190105_164144_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin
I needed to buy a blood pressure monitor. So I was looking in a pharmacy that sold Omron and Braun devices.
MyEachtra - an example lifelog retrieval system
- I will use standard Computer Vision and NLP models to demonstrate the retrieval of lifelog data.
- All the models can be found on Hugging Face and OpenAI.
- Practice code can be found on my GitHub and can be open with Google Colab.
- Hugging Face is an AI research organization that provides state-of-the-art models for NLP and Computer Vision.
- The models are available for free and can be used for various tasks such as classification, translation, summarization, etc.
- The models can be used in Python through the transformers library.
- Google Colab is a free cloud-based Jupyter notebook environment provided by Google.
- It allows users to run Python code in the cloud without the need to install anything on their local machines.
- Some of the metadata can be used for retrieval purposes:
- Location: latitude, longitude, city, country
- Time: timestamp, timezone
- We can use filters.
# Filter based on location
# assuming the metadata is stored in a pandas dataframe
location = "Dublin"
filtered_images = df[df["city"].str.contains(location, na=False, case=False)]
Computer Vision models can be used to extract concepts from images.
And you want want to find a book about Famous Scientists. You would go to the Science section and have a look at the books there.It means the books are organized based on genres or subjects.
So, in the same way, we can organize images based on concepts.
- Concepts are the objects, actions, or scenes that are present in the images.
What can be the concepts in this image?
Model : DETR
ImageId = "20160808_111247_000.jpeg"
concepts = ["cup", "sandwich", "banana", "person", "bottle"]
- What other computer vision models can be used for concept extraction?
- The user can search for images based on concepts and text.
- Example: "Find images of people walking in Dublin"
- Keywords: "people", "walking", "Dublin"
results = df[df["tags"].str.contains("people") & df["tags"].str.contains("walking") & df["city"] == "Dublin"]
- Captioning models can be used to extract concepts from images.
ImageID,Description
20160808_111247_000.jpeg,"A till operator is serving a customer at the cafe"
We can use a more advanced text-retrieval model to search for images based on the captions, such as TF-IDF
- Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document.
- TF: the frequency of a word in a document
- IDF: the inverse of the frequency of the word in the entire corpus - the higher the frequency, the lower the IDF (words like the, and, is have low IDF)
# Using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the captions
X = vectorizer.fit_transform(df["Caption"])
# Search for images based on the captions
query = "I am ordering a coffee at the cafe"
query_vector = vectorizer.transform([query]) # transform the query
results = cosine_similarity(X, query_vector) # calculate the similarity
results = results.argsort()[-10:][::-1] # get the top 10 results
- You need to know the concepts in advance.
- The concepts are not always available in the metadata.
- Lots of models are needed for different concepts.
- The captions are not necessarily on the same level of abstraction as the user's query.
Ref: Kaggle
You want to find a book about Famous Scientists.
The library has a system that can find books based on their content. She doesn't care about the genre or subject, but she knows what books are similar to each other.
She can guide you to the shelf where similar books are located.
- Simply put, embeddings are numerical representations of data.
- They are used to represent:
- Images
- Text
- Audio
- etc.
- There is a distance metric that can be used to measure the similarity between embeddings.
- Euclidean distance
- Cosine similarity
- CLIP (Contrastive Language-Image Pretraining) is a model developed by OpenAI.
- It can be used to retrieve images based on text and vice versa.
Ref: OpenAI
- Inputs: pairs of image and text
- Outputs: similarity matrix
- Objective: maximize the similarity between the correct pairs and minimize the similarity between the incorrect pairs. (Contrastive Loss)
- Lifelog images can be encoded: encoded images
- Search query can be encoded: encoded query
- The similarity between the encoded images and the encoded query can be calculated using the similarity matrix.
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
where @ is the matrix multiplication and softmax is the normalization function.
Description A: "I am ordering a coffee at the cafe" Description B: "I am hiking in the mountains"
## Setting up the model
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Open the image
image = preprocess(Image.open("2019-08-01_09-00-00_0.jpg")).unsqueeze(0).to(device)
# Tokenize the descriptions
texts = clip.tokenize(["I am ordering a coffee at the cafe", "I am hiking in the mountains"]).to(device)
# Encode the image and texts
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Calculate similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(similarity)
The similarity scores can be used to retrieve images based on text and vice versa.
# Assuming the images are already encoded
image_features
# Retrieve images based on text
text = "I am ordering a coffee at the cafe"
similarities = [similarity_score(image, text) for image in image_features]
sorted_images = sorted(similarities, key=lambda x: x[1], reverse=True)
result = [image[0] for image in sorted_images[:10]]
If the user has an example image, the similarity between the example image and the rest of the images can be calculated.
# Assuming the images are already encoded
image_features
# Open the example image
example_image = preprocess(Image.open("example.jpg")).unsqueeze(0).to(device)
# Encode the example image
with torch.no_grad():
example_image_features = model.encode_image(example_image)
# Calculate similarity
similarities = [similarity_score(image, example_image_features) for image in image_features]
sorted_images = sorted(similarities, key=lambda x: x[1], reverse=True)
result = [image[0] for image in sorted_images[:10]]
- What are the advantages and disadvantages of the CLIP model?
- What other scenarios can the CLIP model be used for?
- Due to the privacy of the dataset, I cannot provide the full code here.
- However, the code for retrieving images based on text using the CLIP model can be found on my GitHub
- Lifelogging is a useful way to record and store personal data.
- Retrieval of lifelog data can be done using concept-based and embedding-based methods.
- CLIP model can be used for image-text retrieval.