From 6d6951f17890b11deab6a45fca0b7b34de489909 Mon Sep 17 00:00:00 2001
From: Astra Clelia Bertelli <133636879+AstraBert@users.noreply.github.com>
Date: Sun, 14 Jul 2024 20:39:05 +0200
Subject: [PATCH] Update README.md

---
 double-layered-rag-benchmarks/README.md | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/double-layered-rag-benchmarks/README.md b/double-layered-rag-benchmarks/README.md
index 6247370..6908bf2 100644
--- a/double-layered-rag-benchmarks/README.md
+++ b/double-layered-rag-benchmarks/README.md
@@ -12,7 +12,7 @@ The RAG workflow goes like this:
 - The 10 best hits get re-encoded by `sentence-t5-base` and uploaded to a non-permanent Qdrant collection (gets deleted at every round after use)
 - `sentence-t5-base` performs a second retrieval call extracting the best match to the original query, and this best match gets returned
 
-## Small test
+## First test
 
 Benchmark is based on the content of 4 web pages: 
 
@@ -23,7 +23,7 @@ Benchmark is based on the content of 4 web pages:
 
 The content of these URLs was chunked up and uploaded to Qdrant collections, and at the same time smaller portions of each chunk (encompassing 10-25% of the text) were used for querying, and the retrieved results compared with the original full text.
 
-## Results
+## First results
 
 The correct/total retrievals ratio for the only `All-MiniLM-L6-v2` is 81.54%, whereas the correct/total retrievals ratio for the previously described double-layered `All-MiniLM-L6-v2` + `sentence-t5-base` goes up 93.85%, equalling the one of `sentence-t5-base` alone. Following a double-layered approach with switched roles for the two encoders yields a correct/total retrievals ratio of 84.62%.
 
@@ -31,10 +31,25 @@ The advantage of this technique is that it does not require that all the chunks
 
 The disadvantage is in the execution time: on a 8GB RAM-12 cores Windows 10 laptop, double-layered RAG takes an average of 8.39 s, against the 0.23 s of the sole `sentence-t5-base`.
 
+## Second test
+
+The second benchmark is based on []() dataset, available on HuggingFace. It is a Q&A dataset based on a set of 358 answers (used as content to retrieve) and questions (used as retrieval queries).
+
+## Second results
+
+- Avg time for `All-MiniLM-L6-v2`: 0.13889479704117508 +/- 0.018010675079187972
+- Avg time for `sentence-t5-base`: 0.3546350625123871 +/- 0.1378480017367839
+- Avg time for `All-MiniLM-L6-v2` + `sentence-t5-base`: 10.722357098306164 +/- 1.2552639024596886
+- Avg time for `sentence-t5-base` + `All-MiniLM-L6-v2`: 2.722710152020615 +/- 0.30703480688834284
+- Correct/Total retrievals for `All-MiniLM-L6-v2`: 0.41853932584269665
+- Correct/Total retrievals for `sentence-t5-base`: 0.5084269662921348
+- Correct/Total retrievals for `All-MiniLM-L6-v2` + `sentence-t5-base`: 0.5028089887640449
+- Correct/Total retrievals for `sentence-t5-base` + `All-MiniLM-L6-v2`: 0.42134831460674155
+
 ## Code availability
 
-The benchmark test code is available [here](./scripts/benchmark_test.py)
+The benchmark test code is available [here](./scripts/benchmark_test.py) for the first test and [here](./rageval.ipynb) for the second one.
 
 ## Contributions
 
-If you happen to have time and a powerful hardware, you can carry on vaster tests using the script referenced before: it would be great!
\ No newline at end of file
+If you happen to have time and a powerful hardware, you can carry on vaster tests using the script referenced before: it would be great!