Skip to content

do-me/copernicus-services-semantic-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Copernicus Services Semantic Search

Tutorial here with lots of tips

A basic semantic search app based on 834 entries from Copernicus Services Catalogue chunked and indexed (mean embedding of all content chunks) in a ~2.4MB gzipped json with all-MiniLM-L6-v2. Enter any query and hit submit or enter. App loads ~27Mb of resources of data and scripts. The ML model runs entirely in the browser thanks to transformers.js.

Advanced search

If you'd like to search within the result's content, consider installing the Chrome extension of SemanticFinder, GitHub repo.

It finds the most relevant sections to your query in the actual content of the results by performing semantic search on the fly.

Data mining tutorial

The process of creating the data dump includingcan be repeated with the included Jupyter Notebook. It includes the whole processing pipeline:

  • data mining with requests and beautifulsoup
  • preprocessing in pandas
  • chunking the document text in smaller paragraphs of the right size for the ML model
  • creating embeddings for each chunk
  • calculating the mean embedding for each document
  • saving as gzipped json (small file size & easy and fast to read in js with pako.js)

You can re-run the process for updates (if you do so, please open a pull request for this repo or write so I can keep the data dump updated) or use other indexing models like the current MTEB leaders of the bge or gte family. You could also use a multilingual model to perform search queries in other languages than English. The current dump holds 834 entries from 21 October 2023.

Export all 834 entries for large LLM context

How to create this text file with JS - Run a search and display all results (enter 1000 as limit). The results are ordered by similarity. - Open the browser console with F12 - Use this JS and execute it:
 document.querySelectorAll('.position-relative').forEach(function(element) {
    // Remove each element from the DOM
    element.remove();
  });

function tableToText() {
            // Select the table
            const table = document.getElementById('results-table');
            let resultText = '';

            // Loop through each row
            for (let row of table.rows) {
                // Loop through each cell in the row, excluding the "Similarity" column (index 2)
                for (let i = 0; i < row.cells.length; i++) {
                    if (i === 5) continue; // Skip Similarity

                    const cell = row.cells[i];

                    // For the first two columns, check if there are anchor tags
                    if ((i === 1) | (i===2)) {
                        const link = cell.querySelector('a');
                        if (link) {
                            // Use the href attribute of the anchor tag
                            resultText += link.href + '\n';
                        } else {
                            resultText += cell.innerText + '\n';  // Fallback to normal text
                        }
                    } else {
                        resultText += cell.innerText + '\n';  // For other columns, use innerText
                    }
                }
                resultText += '\n\n';  // Add two line breaks between rows
            }

            console.log(resultText);  // Log the result to the console
        }

// Call the function to convert table to text and log it
tableToText();
  • Copy the output with the copy button (e.g. in Chrome or select the whole text) image

Qdrant Instance

I provide a public Qdrant instance over Qdrant Cloud that you can access to create nice plots for the collection via dimensionality reduction or graph-based links. Access the collection with the API key A-KWBxWl_8G3cnXv3MlpCThEDTdS6FYnTzn-h9k9TE95f5cvMUAGbQ under:

https://8f35f088-fc2e-426e-92a3-4f4f26f64812.europe-west3-0.gcp.cloud.qdrant.io:6333/dashboard#/collections/Copernicus_Services/

Scatterplot

Click on visualize or access this link. Then enter this code an hit RUN:

{
  "limit": 5000,
  "color_by": "Copernicus_Service"
}

image

Graph

Click on graph or access this link. Then hit RUN.

image

Download Qdrant Snapshot

You can download the snapshot here and run it locally too.

If you like this project, ⭐ the repo or give a shoutout on social media. Let me know if you build something cool with it!

About

A basic semantic search app based on 834 entries from Copernicus Services Catalogue

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published