You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
25th March 2024 Supervised by: Nandan Thakur Crystina, Xinyu ZHANG Working Style: Weekly Sync-up Meetings (Slack for urgent/code debugging)
OVERVIEW
SWIM-X models have recently shown to be great in cross-lingual and multilingual retrieval settings (source). However, Encoder-only models (mT5) are restricted by a context length of ~ 512 tokens. They require large amounts of synthetic training data for fine-tuning + pre-training for retrieval, making extending across ~101 languages difficult.
The first line of work would be to benchmark SWIM-X in Pyserini and make reproducible baselines; as a warm-up to get familiar with the existing models and the datasets.
RELATED WORK
mE5-mistral-7B (source) are recently introduced multilingual Mistral-7B LLM-based decoder models. However, the training dataset is unavailable and the model uses a high amount of high-quality synthetic training data from GPT-4. Our work will focus more on efficient fine-tuning using a smaller subset of multilingual training data.
Research Questions
Baseline: Reproduce SWIM-X (source) and push for 2CR within Pyserini/Anserini.
Compare SWIM-X repro against other multilingual retrieval LLMs such as mE5-mistral-7B/Cohere Command-R.
Future Scope
Further, we would like to examine multilingual LLMs (in contrast to the mT5 model as in SWIM-X) using a (small) few-shot synthetic-only training dataset. Would we still require a large training dataset such as SWIM-X? Or would a few-shot examples be enough for multilingual retrieval-based LLM? How do we extend the model across 101 languages in mC4?
Explore the best approach to fine-tune LLM-based retrieval models such as Gemma-2b/Mistral-7b-v0.2 on the SWIM-IR dataset.
Research on the optimal number of (lowest) synthetic training data pairs is required.
Resources
Tevatron/LoRA for fine-tuning LLMs for retrieval: GitHub
Reproduce the SWIM-X models, create 2CR checkpoints for MIRACL, and include them in Pyserini. Reproduce evaluation on XOR-RETRIEVE and XTREME-UP.
Familiarize with LLM Retrieval Fine-tuning (M2)
Run experiments to reproduce the rank llama example: (GitHub) and use that as an example to extend Gemma-2b/Mistral-7b on multilingual retrieval datasets; either synthetic (SWIM-IR), human-labeled (MIRACL) or translated (mMARCO).
FUTURE MILESTONES
Few-shot LLM Retrieval Fine-tuning
Depending on results in M2, further extend models to fine-tuning only a few-shot examples in different languages (Idea similar to SETFIT GitHub). Find an optimal number of training examples required in each language.
Extending Multilingual LLMs to 101 languages (M4)
If M3 works out successfully, we can generate synthetic datasets for 101 languages (overlap with the same languages in mC4) and fine-tune a multilingual LLM across 101 languages.
We have a new project involving multilingual retrieval and reproduction and we are looking for 2 URA students to work together.
Feel free to reach out on Slack or email us at [email protected], [email protected].
Synthetic data-based multilingual LLM retrieval models
25th March 2024
Supervised by: Nandan Thakur Crystina, Xinyu ZHANG
Working Style: Weekly Sync-up Meetings (Slack for urgent/code debugging)
OVERVIEW
SWIM-X models have recently shown to be great in cross-lingual and multilingual retrieval settings (source). However, Encoder-only models (mT5) are restricted by a context length of ~ 512 tokens. They require large amounts of synthetic training data for fine-tuning + pre-training for retrieval, making extending across ~101 languages difficult.
The first line of work would be to benchmark SWIM-X in Pyserini and make reproducible baselines; as a warm-up to get familiar with the existing models and the datasets.
RELATED WORK
mE5-mistral-7B (source) are recently introduced multilingual Mistral-7B LLM-based decoder models. However, the training dataset is unavailable and the model uses a high amount of high-quality synthetic training data from GPT-4. Our work will focus more on efficient fine-tuning using a smaller subset of multilingual training data.
Research Questions
Future Scope
Further, we would like to examine multilingual LLMs (in contrast to the mT5 model as in SWIM-X) using a (small) few-shot synthetic-only training dataset. Would we still require a large training dataset such as SWIM-X? Or would a few-shot examples be enough for multilingual retrieval-based LLM? How do we extend the model across 101 languages in mC4?
Explore the best approach to fine-tune LLM-based retrieval models such as Gemma-2b/Mistral-7b-v0.2 on the SWIM-IR dataset.
Research on the optimal number of (lowest) synthetic training data pairs is required.
Resources
MILESTONES
FUTURE MILESTONES
RELEVANT READING MATERIAL
The text was updated successfully, but these errors were encountered: