Updated How To Make Your Own Search Engine

ScottLogic · Aug 16, 2023 · f7d2927 · f7d2927
1 parent 827d52e
commit f7d2927
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/_drafts/how-to-make-your-own-search-engine.markdown b/_drafts/how-to-make-your-own-search-engine.markdown
@@ -1,5 +1,5 @@
 ---
-title: How to Make Your Own Search Engine Using Embedding
+title: 'How to Make Your Own Search Engine: Semantic Search With LLM Embeddings'
 date: 2023-08-11 09:40:00 Z
 categories:
 - Tech
@@ -108,7 +108,7 @@ Large Language Models (LLMs) are machine learning models that have been trained
 
 Vectors are lists of values, where the length of the list is the dimension of the vector, so a 3D vector has 3 values. Lists of numbers are often not very easy to see patterns in, so we visulise them we by showing the vector spatially, by interpreting each value in the vector as a coordinate in space. 
 
-Embeddings can contain different amounts of context; from sentence embedding which represents the meaning of a whole sentence, to word embeddings that represent the meaning of individual words independent of their context. In semantic search we want to take into account as much context as we can, therefore we will be using sentence embedding for this application. We can then combine several sentence embeddings . The sentence embedding vector contains many values, and each of these values represent the strength of a category that somehow represent the meaning of our sentence. This means the number of values in the vector are the number of categories it contains. The values represent the strength of that category in a range from 0 to 1, where 1 means our sentence is a perfect match for a certain category and 0 means the sentence doesn’t fit the category at all. These categories are decided by the LLM while it is being trained and aren’t obvious human categories, hence they can be tricky to interpret exactly, but an analogy would be categories like *positivity* or *isBangladeshiFood*.
+Embeddings are vectors that can contain different amounts of context; from sentence embedding which represents the meaning of a whole sentence, to word embeddings that represent the meaning of individual words independent of their context. In semantic search we want to take into account as much context as we can, therefore we will be using sentence embedding for this application. We can then combine several sentence embeddings . The sentence embedding vector contains many values, and each of these values represent the strength of a category that somehow represent the meaning of our sentence. This means the number of values in the vector are the number of categories it contains. The values represent the strength of that category in a range from 0 to 1, where 1 means our sentence is a perfect match for a certain category and 0 means the sentence doesn’t fit the category at all. These categories are decided by the LLM while it is being trained and aren’t obvious human categories, hence they can be tricky to interpret exactly, but an analogy would be categories like *positivity* or *isBangladeshiFood*.
 
 If our document or query contains many sentences we will get several sentence embeddings for each when we run our LLM’s encoding. We want the document and query to both be represented by just one embedding vector each; a document embedding vector and a query embedding vector. To achieve this we need to summarise our many sentence embeddings, we can do this by taking the average for each category of all the sentence embeddings. This gives us a summary embedding. This is can work because the embedding vector is consistent when using the same LLM, it has the same categories, same size of vector.