Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREC 2024 - Workflow Record #48

Open
Yuv-sue1005 opened this issue Aug 13, 2024 · 6 comments
Open

TREC 2024 - Workflow Record #48

Yuv-sue1005 opened this issue Aug 13, 2024 · 6 comments

Comments

@Yuv-sue1005
Copy link

This issue isn't a problem to be fixed, rather it is a record to keep track of undergraduate student's work on TREC 2024. In a series of comments, contributors can write about their work (at a high level) on specific tracks. This issue will be regularly updated upon finishing certain tasks/tracks.

@Yuv-sue1005
Copy link
Author

Here's my contributions to TREC's NeuCLIR track.

  • Task: Cross-Language Technical Documents
    • Assisted in creation of a f-stage, gte-qwen2 dense retrieval baseline. Assisted in writing scripts for encoding corpus and queries, and then running retrieval with aforementioned embeddings.
    • Created BM25 document translation and query translation baselines. Indexed corpus, ran baselines, and evaluated all created runs.
    • Created a SPLADE document translated baseline. Encoded the corpus and queries using splade model, indexed corpus embeddings, ran the baseline, and evaluated results. This baseline was not used in submissions, given it's surprisingly low eval scores.
    • Created and ran several data munging scripts to reformat queries and corpus.
  • Task: Multilingual Retrieval (MLIR)
    • Created and used multiple reformatting scripts, the most notable one being converting a trec run into retrieval results format.
    • Created SPLADE, BM25-dt, BM25-qt, and PLAID baselines for task (basically all f-stage baselines). Ran all baselines, evaluated runs, RRFed runs into fusion runs, and then evaluated fusion runs. Fusion runs were sent off to mono stage.
    • Fused (with RRF) and evaluated post-mono and post-listo runs.
  • Task: Cross-language Retrieval (CLIR)
    • Taking post-mono fused runs from MLIR, I combined the zho, rus, and fas runs to create a top 300 retrieval results run for CLIR. This new retrieval results file was then sent off to list-wise reranking.
  • Task: Cross-Language Report Generation
    • Created SPLADE, BM25-dt, and PLAID baselines for task (all f-stage baselines). Ran all baselines, evaluated runs, RRFed runs into fusion runs, and then evaluated fusion runs. Fusion runs were sent off to mono stage.
    • Created and used multiple reformatting scripts.

@Stefan824
Copy link

Cross-Language Technical Document Tasks

1. Data Preprocessing

  • Preprocessed all relevant data to a Pyserini-compatible format (corpus, topics/queries, and qrels).

2. Dense Embedding Experiments Reproduction

  • Reproduced existing experiments using Pyserini for pre-encoded corpus and queries.
  • Gained insights into how Pyserini handles external models.

3. Dense Retrieval Baseline (GTE-Qwen2)

  • Developed and implemented the dense retrieval baseline using the GTE-Qwen2 model (a dense embedding model not natively supported by Pyserini).
  • Scripted the encoding process for both corpus and queries, converting them into a Pyserini-compatible format.
  • Executed indexing, searching, and evaluation using Pyserini.

4. Integration for Pyserini with the Qwen Model

  • Added connector code to integrate the GTE-Qwen2 dense embedding model with Pyserini.
  • Currently testing model performance; results are pending.

5. BM25 and SPLADE Baselines (Document and Query Translation)

  • Assisted with setting up baselines using supported models from Pyserini/Anserini.

6. Runs Fusion

  • Scripted processes for fusing outputs from different baselines.
  • Finalized outstanding runs with run fusion strategies.

Multi-Language Information Retrieval (MLIR) and Cross-Language Information Retrieval (CLIR) Tasks

1. Baseline Setup and Reproduction

  • Located and reproduced baselines from previous years.
  • Contributed to the setup of all first-stage retrieval baselines.

Report Generation Tasks

1. Report Request Handling

  • Scripted tools to extract and format key information from report requests.

2. Prompt Engineering for GPT-4

  • Engineered prompts for GPT-4 to break down report requests into sub-questions.
  • Implemented scripts to generate results from these sub-questions.

3. First-Stage Retrieval for Reports

  • Performed initial retrieval on report requests paired with corresponding sub-questions.

4. Reranking with Cohere Reranker

  • Applied reranking using the Cohere reranker on the initial retrieval results.

@Yuv-sue1005
Copy link
Author

Here's my contributions to the TREC RAG track.

@Yuv-sue1005
Copy link
Author

My contributions to TREC ToT.

  • Collaboratively built a Llama-3.1 baseline that uses PromptReps to do f-stage retrieval. Building this baseline involved reformatting TREC's queries/corpus/qrels to fit PromptReps' requirements, becoming familiar with PromptReps, encoding dense and sparse representations of the corpus, generating a sparse index, and searching through PromptReps.
  • Following the footsteps of last year's top team, I created a script from scratch that adds TOMT-KIS, a dataset of ~1.2 million ToT questions from reddit, to our given corpus (~3 million docs) and queries (150 queries). This was to create a larger corpus/query set to train distilbert on. In total, roughly 90k query-document pairs were added. To implement this corpus expansion, I learned how to use hugging face, VLLM, and difflib, and improved my prompt engineering, logic and problem solving skills.
  • Recreated TREC ToT's given BM25, Distilbert, and GPT4o baselines. Involved modifying and adding to ToT's scripts, fixing errors on the fly, and learning the basic concepts of each baseline.
  • Discussed numerous baseline ideas, did research into successful team's baselines/papers and implemented ideas, and provided coding support whenever possible.

@natek-1
Copy link

natek-1 commented Nov 1, 2024

Trec TOT contribution

  • Integrated the PromptReps repo for usage on the TOT dataset. With collaboration, adjusted the dataset to expected format, encode dense and sparse representation and built sparse index
  • Built on the work done by last year's top team approach, using the ideas presented in their paper and their existing code for the purposes of reranking.

@Stefan824
Copy link

Trec TOT contribution:

  • Literature review on successful teams from past years and reproducing results
  • Identified corpus-expansion strategy and helped with script-writing
  • Worked on various scripts like add_tomt_kis_vllm.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants