Copernicus Services Semantic Search

App here

Tutorial here with lots of tips

A basic semantic search app based on 834 entries from Copernicus Services Catalogue chunked and indexed (mean embedding of all content chunks) in a ~2.4MB gzipped json with all-MiniLM-L6-v2. Enter any query and hit submit or enter. App loads ~27Mb of resources of data and scripts. The ML model runs entirely in the browser thanks to transformers.js.

Advanced search

If you'd like to search within the result's content, consider installing the Chrome extension of SemanticFinder, GitHub repo.

It finds the most relevant sections to your query in the actual content of the results by performing semantic search on the fly.

Data mining tutorial

The process of creating the data dump includingcan be repeated with the included Jupyter Notebook. It includes the whole processing pipeline:

data mining with requests and beautifulsoup
preprocessing in pandas
chunking the document text in smaller paragraphs of the right size for the ML model
creating embeddings for each chunk
calculating the mean embedding for each document
saving as gzipped json (small file size & easy and fast to read in js with pako.js)

You can re-run the process for updates (if you do so, please open a pull request for this repo or write so I can keep the data dump updated) or use other indexing models like the current MTEB leaders of the bge or gte family. You could also use a multilingual model to perform search queries in other languages than English. The current dump holds 834 entries from 21 October 2023.

Export all 834 entries for large LLM context

Just use this plain text file for copy & paste: https://raw.githubusercontent.com/do-me/copernicus-services-semantic-search/main/copernicus-services.txt.
In Gemini (https://aistudio.google.com), this text counts roughly 1.5 Mio tokens, so you can still add large prompts within the 2 Mio context window.

How to create this text file with JS

- Run a search and display all results (enter 1000 as limit). The results are ordered by similarity. - Open the browser console with F12 - Use this JS and execute it:

 document.querySelectorAll('.position-relative').forEach(function(element) {
    // Remove each element from the DOM
    element.remove();
  });

function tableToText() {
            // Select the table
            const table = document.getElementById('results-table');
            let resultText = '';

            // Loop through each row
            for (let row of table.rows) {
                // Loop through each cell in the row, excluding the "Similarity" column (index 2)
                for (let i = 0; i < row.cells.length; i++) {
                    if (i === 5) continue; // Skip Similarity

                    const cell = row.cells[i];

                    // For the first two columns, check if there are anchor tags
                    if ((i === 1) | (i===2)) {
                        const link = cell.querySelector('a');
                        if (link) {
                            // Use the href attribute of the anchor tag
                            resultText += link.href + '\n';
                        } else {
                            resultText += cell.innerText + '\n';  // Fallback to normal text
                        }
                    } else {
                        resultText += cell.innerText + '\n';  // For other columns, use innerText
                    }
                }
                resultText += '\n\n';  // Add two line breaks between rows
            }

            console.log(resultText);  // Log the result to the console
        }

// Call the function to convert table to text and log it
tableToText();

Copy the output with the copy button (e.g. in Chrome or select the whole text)

Qdrant Instance

I provide a public Qdrant instance over Qdrant Cloud that you can access to create nice plots for the collection via dimensionality reduction or graph-based links. Access the collection with the API key A-KWBxWl_8G3cnXv3MlpCThEDTdS6FYnTzn-h9k9TE95f5cvMUAGbQ under:

https://8f35f088-fc2e-426e-92a3-4f4f26f64812.europe-west3-0.gcp.cloud.qdrant.io:6333/dashboard#/collections/Copernicus_Services/

Scatterplot

Click on visualize or access this link. Then enter this code an hit RUN:

{
  "limit": 5000,
  "color_by": "Copernicus_Service"
}

Graph

Click on graph or access this link. Then hit RUN.

Download Qdrant Snapshot

You can download the snapshot here and run it locally too.

If you like this project, ⭐ the repo or give a shoutout on social media. Let me know if you build something cool with it!

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
Cop_shorted_Gemini_Flash_1_5_summary_llmlingua_0_8.txt		Cop_shorted_Gemini_Flash_1_5_summary_llmlingua_0_8.txt
LICENSE		LICENSE
README.md		README.md
copernicus-services-df.png		copernicus-services-df.png
copernicus-services-semantic-search-interface-dark.png		copernicus-services-semantic-search-interface-dark.png
copernicus-services.txt		copernicus-services.txt
copernicus_services_embeddings.json.gz		copernicus_services_embeddings.json.gz
copernicus_services_embeddings_bge-base.json.gz		copernicus_services_embeddings_bge-base.json.gz
copernicus_services_miner.ipynb		copernicus_services_miner.ipynb
index.html		index.html
main.css		main.css
main.js		main.js
semantic-finder-results.png		semantic-finder-results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Copernicus Services Semantic Search

App here

Tutorial here with lots of tips

Advanced search

Data mining tutorial

Export all 834 entries for large LLM context

Qdrant Instance

Scatterplot

Graph

Download Qdrant Snapshot

About

Releases

Packages

Languages

License

do-me/copernicus-services-semantic-search

Folders and files

Latest commit

History

Repository files navigation

Copernicus Services Semantic Search

App here

Tutorial here with lots of tips

Advanced search

Data mining tutorial

Export all 834 entries for large LLM context

Qdrant Instance

Scatterplot

Graph

Download Qdrant Snapshot

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages