From 3613cde47a257f9b0826ddc7ec8cdffb8286a360 Mon Sep 17 00:00:00 2001
From: Shivank Garg <128126577+shivank21@users.noreply.github.com>
Date: Sat, 9 Mar 2024 11:53:09 +0530
Subject: [PATCH] :zap: Add Summary for Custom Diffusion

---
 ...ustomization_of_Text_to_Image_Diffusion.md | 54 +++++++++++++++++++
 1 file changed, 54 insertions(+)
 create mode 100644 summaries/Multi Concept_Customization_of_Text_to_Image_Diffusion.md

diff --git a/summaries/Multi Concept_Customization_of_Text_to_Image_Diffusion.md b/summaries/Multi Concept_Customization_of_Text_to_Image_Diffusion.md
new file mode 100644
index 0000000..24242c6
--- /dev/null
+++ b/summaries/Multi Concept_Customization_of_Text_to_Image_Diffusion.md	
@@ -0,0 +1,54 @@
+# Multi-Concept Customization of Text-to-Image Diffusion
+Nupur Kumari,Bingliang Zhang,Richard Zhang,Eli Shechtman & Jun-Yan Zhu, **CVPR 2023**
+
+## Summary 
+
+While text-to-image diffusion models generally perform well, they face challenges with specific or nuanced concepts due to limited training data. The paper suggests a novel approach, introducing a method called custom diffusion to fine-tune pre-trained models. This enables the integration of new concepts with minimal resources and data.
+
+![teaser](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/54447d1c-fd57-4625-88f3-9b109ca9cc34)
+
+
+## Contributions 
+
+- Demonstrates a quick and computationally efficient fine-tuning process that enables models to generate images of new concepts in detail.(Takes ~6 mins with 2 A100 GPU's)
+- Introduces a method for the model to learn several new concepts at once and blend them together in generated images, making the model more useful for handling intricate scenes.
+- Shows better performance to other methods in empirical evaluations
+-  Introduces a new dataset of 101 concepts for evaluating model customization methods along with text prompts for single-concept and multi-concept compositions
+
+## Method 
+
+Given a set of target images, the method first retrieves (generates) regularization images with similar captions as target images. The final training dataset is union of target and regularization images. During fine-tuning the method update the key and value projection matrices of the cross-attention blocks in the diffusion model with the standard diffusion training loss. 
+![methodology](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/4145d0a5-a5af-4ce5-9b04-2c617188488b)
+
+Cross-attention block modifies the latent features of the network according to the condition features, i.e., text features in the case of text-to-image diffusion models.Given text features $c \in \mathbb{R}^{s \times d}$ and latent image features $f \in \mathbb{R}^{ (h\times w) \times l}$, a single-head cross-attention operation consists of $Q = W^q \textbf{f}, K = W^{k} \textbf{c}, V = W^{v} \textbf{c}$, and a weighted sum over value features as:
+
+$$\begin{equation}
+    \begin{aligned}
+    \text{Attention}(Q, K, V) = \text{Softmax}\Big(\frac{QK^T}{\sqrt{d'}} \Big)V, \\
+    \end{aligned}
+\end{equation}$$
+
+ where $W^q$, $W^k$, and $W^v$ map the inputs to a query, key, and value feature, respectively, and $d'$ is the output dimension of key and query features. The latent feature is then updated with the attention block output. The task of fine-tuning aims at updating the mapping from given text to image distribution, and the text features are only input to $W^k$ and $W^v$ projection matrix in the cross-attention block.Therefore, the authors propose to only update $W^{k}$ and $W^{v}$ parameters of the diffusion model during the fine-tuning process.
+
+## Results 
+
+### Single-Concept
+
+![Tortoise](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/33a8f690-ab26-48c7-9e7b-ae6962a50cae)
+
+Where, $V^∗$ is initialized with different rarely-occurring tokens and optimized along with cross-attention key and value
+matrices for each layer during fine tuning.
+
+### Multi-Concept
+
+![Chair_Table](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/8a93e8c6-d05f-431c-af8f-2a54ff258f27)
+
+## Two-Cents
+
+The research takes a big step forward in tailoring models for turning text into images, but it also points to various possibilities ahead. One might consider expanding this approach to cover a broader range of creative tasks, like generating videos or even creating audio based on text descriptions.However, there are a couple of limitations. Tricky combinations, like having both a pet dog and a pet cat, still pose a challenge. Additionally, putting together three or more concepts becomes a  difficult task with this method.
+
+## Resources 
+- Webpage : https://www.cs.cmu.edu/~custom-diffusion/
+- Paper https://arxiv.org/pdf/2212.04488.pdf
+
+