Lifelogging and Image-Text Modelling

Ly-Duyen Tran (Allie)
[email protected]

Outline

Introduction to Lifelogging
Lifelog Retrieval
Concept-based Retrieval
CLIP model
Miscellaneous

Lifelogging

Lifelogging is the process of tracking personal data generated by our own daily activities.
It is a way to record and store our memories, experiences, and emotions.

Applications of Lifelogging

Personal memory enhancement
Health monitoring
Personal data analysis

Lifelog Retrieval

Lifelog retrieval is the process of searching and retrieving lifelog data.
It is a challenging task due to the large amount of data and the need for efficient retrieval methods.

I needed to buy a blood pressure monitor. So I was looking in a pharmacy that sold Omron and Braun devices.

Test Dataset

For demonstration purposes, I will use the test dataset from the Lifelog Search Challenge 2022.

Example of the dataset:

Image: ![[lifelog_example.png]]
Metadata: csv file

ImageID,new_lat,new_lng,semantic_name,city,country,timezone
20190105_164008_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin
20190105_164040_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin
20190105_164112_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin
20190105_164144_000.jpg,53.386,-6.261,DCU School of Computing,"Dublin, Ireland, Leinster",Ireland,Europe/Dublin

Example query

I needed to buy a blood pressure monitor. So I was looking in a pharmacy that sold Omron and Braun devices.

MyEachtra - an example lifelog retrieval system

Demos

I will use standard Computer Vision and NLP models to demonstrate the retrieval of lifelog data.
All the models can be found on Hugging Face and OpenAI.
Practice code can be found on my GitHub and can be open with Google Colab.

What is HuggingFace?

Hugging Face is an AI research organization that provides state-of-the-art models for NLP and Computer Vision.
The models are available for free and can be used for various tasks such as classification, translation, summarization, etc.
The models can be used in Python through the transformers library.

What is Google Colab?

Google Colab is a free cloud-based Jupyter notebook environment provided by Google.
It allows users to run Python code in the cloud without the need to install anything on their local machines.

Metadata

Some of the metadata can be used for retrieval purposes:
- Location: latitude, longitude, city, country
- Time: timestamp, timezone
We can use filters.

# Filter based on location
# assuming the metadata is stored in a pandas dataframe
location = "Dublin"
filtered_images = df[df["city"].str.contains(location, na=False, case=False)]

Concept-based approach

Computer Vision models can be used to extract concepts from images.

Analogy

Imagine you were in a library...

And you want want to find a book about Famous Scientists. You would go to the Science section and have a look at the books there.It means the books are organized based on genres or subjects.

So, in the same way, we can organize images based on concepts.

What are concepts?

Concepts are the objects, actions, or scenes that are present in the images.

What can be the concepts in this image?

Object Detection example

Model : DETR

ImageId = "20160808_111247_000.jpeg"
concepts = ["cup", "sandwich", "banana", "person", "bottle"]

Discussion

What other computer vision models can be used for concept extraction?

Concept-based Retrieval

The user can search for images based on concepts and text.
Example: "Find images of people walking in Dublin"
Keywords: "people", "walking", "Dublin"

results = df[df["tags"].str.contains("people") & df["tags"].str.contains("walking") & df["city"] == "Dublin"]

Using captions

Captioning models can be used to extract concepts from images.

ImageID,Description
20160808_111247_000.jpeg,"A till operator is serving a customer at the cafe"

We can use a more advanced text-retrieval model to search for images based on the captions, such as TF-IDF

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document.
TF: the frequency of a word in a document
IDF: the inverse of the frequency of the word in the entire corpus - the higher the frequency, the lower the IDF (words like the, and, is have low IDF)

Example

# Using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the captions
X = vectorizer.fit_transform(df["Caption"])

# Search for images based on the captions
query = "I am ordering a coffee at the cafe"
query_vector = vectorizer.transform([query]) # transform the query
results = cosine_similarity(X, query_vector) # calculate the similarity
results = results.argsort()[-10:][::-1] # get the top 10 results

Problems

You need to know the concepts in advance.
The concepts are not always available in the metadata.
Lots of models are needed for different concepts.
The captions are not necessarily on the same level of abstraction as the user's query.

Embedding-based Retrieval

Ref: Kaggle

Analogy

Back to the library...

You want to find a book about Famous Scientists.

The library has a system that can find books based on their content. She doesn't care about the genre or subject, but she knows what books are similar to each other.

She can guide you to the shelf where similar books are located.

What are embeddings?

Simply put, embeddings are numerical representations of data.
They are used to represent:
- Images
- Text
- Audio
- etc.
There is a distance metric that can be used to measure the similarity between embeddings.
- Euclidean distance
- Cosine similarity

CLIP model

CLIP (Contrastive Language-Image Pretraining) is a model developed by OpenAI.
It can be used to retrieve images based on text and vice versa.

Ref: OpenAI

How does it work?

Inputs: pairs of image and text
Outputs: similarity matrix
Objective: maximize the similarity between the correct pairs and minimize the similarity between the incorrect pairs. (Contrastive Loss)

How to use it?

Lifelog images can be encoded: encoded images
Search query can be encoded: encoded query
The similarity between the encoded images and the encoded query can be calculated using the similarity matrix.

similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

where @ is the matrix multiplication and softmax is the normalization function.

Example

Description A: "I am ordering a coffee at the cafe" Description B: "I am hiking in the mountains"

## Setting up the model
import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Open the image
image = preprocess(Image.open("2019-08-01_09-00-00_0.jpg")).unsqueeze(0).to(device)
# Tokenize the descriptions
texts = clip.tokenize(["I am ordering a coffee at the cafe", "I am hiking in the mountains"]).to(device)

# Encode the image and texts
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    # Calculate similarity
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    print(similarity)

Retrieval

The similarity scores can be used to retrieve images based on text and vice versa.

# Assuming the images are already encoded
image_features
# Retrieve images based on text
text = "I am ordering a coffee at the cafe"
similarities = [similarity_score(image, text) for image in image_features]
sorted_images = sorted(similarities, key=lambda x: x[1], reverse=True)
result = [image[0] for image in sorted_images[:10]]

Query-by-Example

If the user has an example image, the similarity between the example image and the rest of the images can be calculated.

# Assuming the images are already encoded
image_features

# Open the example image
example_image = preprocess(Image.open("example.jpg")).unsqueeze(0).to(device)
# Encode the example image
with torch.no_grad():
    example_image_features = model.encode_image(example_image)

# Calculate similarity
similarities = [similarity_score(image, example_image_features) for image in image_features]
sorted_images = sorted(similarities, key=lambda x: x[1], reverse=True)
result = [image[0] for image in sorted_images[:10]]

Discussion

What are the advantages and disadvantages of the CLIP model?
What other scenarios can the CLIP model be used for?

Codes

Due to the privacy of the dataset, I cannot provide the full code here.
However, the code for retrieving images based on text using the CLIP model can be found on my GitHub

Conclusion

Lifelogging is a useful way to record and store personal data.
Retrieval of lifelog data can be done using concept-based and embedding-based methods.
CLIP model can be used for image-text retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Images		Images
000000002157.jpg		000000002157.jpg
20160808_111247_000.jpeg		20160808_111247_000.jpeg
README.md		README.md
coco_demo.ipynb		coco_demo.ipynb

allie-tran/GuestLecture

Folders and files

Latest commit

History

Repository files navigation

Lifelogging and Image-Text Modelling

Outline

Lifelogging

Applications of Lifelogging

Lifelog Retrieval

Test Dataset

Example of the dataset:

Example query

Demos

What is HuggingFace?

What is Google Colab?

Metadata

Concept-based approach

Analogy

Imagine you were in a library...

What are concepts?

Object Detection example

Discussion

Concept-based Retrieval

Using captions

TF-IDF

Example

Problems

Embedding-based Retrieval

Analogy

Back to the library...

What are embeddings?

CLIP model

How does it work?

How to use it?

Example

Retrieval

Query-by-Example

Discussion

Codes

Conclusion

Thank you!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages