UserWarning: n_neighbors is larger than the dataset size #2070

gandrenacci · 2024-11-07T14:47:28Z

gandrenacci
Nov 7, 2024

Hi,
When running a task with Giskard, the following warning is generated:

2024-11-07 15:35:25,209 pid:460389 MainThread giskard.rag  INFO     Finding topics in the knowledge base.
/envs/rag_test/lib/python3.9/site-packages/umap/umap_.py:2462: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
2024-11-07 15:35:25,554 pid:460389 MainThread giskard.rag  INFO     Found 1 topics in the knowledge base.

This warning occurs while finding topics in the knowledge base. It seems that the n_neighbors parameter is set to a value larger than the dataset size, causing UMAP to adjust it automatically to avoid errors.

Could you clarify if this is expected behavior, or if there’s a recommended parameter setting to avoid this warning?

Thanks

Answered by mattbit

Nov 15, 2024

Hi @gandrenacci, there is in principle nothing to worry about, this happens when the knowledge base is small.

When finding topics in a knowledge base, we calculate the embedding for each document chunk and then use UMAP to reduce their dimensionality before applying the HDBSCAN clustering algorithm. We do this step because HDBSCAN does not work well on high dimensional data.

The parameters for UMAP are hard coded empirically based on what we observed would work on common knowledge bases, in particular we set n_neighbors=50. This parameter defines the tradeoff between local and global structure, and since in topic clustering we are mostly interested in coarse structures we set it to a rela…

View full answer

mattbit · 2024-11-15T09:07:36Z

mattbit
Nov 15, 2024
Maintainer

Hi @gandrenacci, there is in principle nothing to worry about, this happens when the knowledge base is small.

When finding topics in a knowledge base, we calculate the embedding for each document chunk and then use UMAP to reduce their dimensionality before applying the HDBSCAN clustering algorithm. We do this step because HDBSCAN does not work well on high dimensional data.

The parameters for UMAP are hard coded empirically based on what we observed would work on common knowledge bases, in particular we set n_neighbors=50. This parameter defines the tradeoff between local and global structure, and since in topic clustering we are mostly interested in coarse structures we set it to a relatively large value.

When you knowledge base has less than 50 documents, UMAP will raise this warning and reset this parameter. In this case UMAP will operate fully on the global structure, potentially losing some local details. For the topic clustering this is ok, and you can safely ignore that warning.

It is worth to double-checking if it is expected for your knowledge base to be that small. Note that when you are creating the knowledge base with RAGET, it's better to provide the already chunked documents than the raw text.

For example, you may use langchain text splitters or any other method used by your RAG to obtain the chunks, and initialize the KnowledgeBase entity with those.

2 replies

gandrenacci Nov 15, 2024
Author

Thanks for your replay. Maybe this isn’t the most effective way to handle it, I’ve been sending one document at a time so that I can adjust the number of questions based on the document's length. My entire knowledge base is small, less than 100 docs.

Yes, I am using a text splitter for chunking the documents.

mattbit Nov 15, 2024
Maintainer

I see, clearly in that case the whole topic clustering is negligible then and you can ignore the warning.

We don't have a way to generate a specific number of questions per document, but your workaround of using one document at a time should work well. On the other hand, we guarantee that when you pass multiple documents we generate roughly the same number of questions per document (so if you pass 10 docs and ask for 10 questions, we will generate 1 question per doc to ensure maximal coverage).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giskard

UserWarning: n_neighbors is larger than the dataset size #2070

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Giskard

UserWarning: n_neighbors is larger than the dataset size #2070

gandrenacci Nov 7, 2024

Replies: 1 comment · 2 replies

mattbit Nov 15, 2024 Maintainer

gandrenacci Nov 15, 2024 Author

mattbit Nov 15, 2024 Maintainer

gandrenacci
Nov 7, 2024

Replies: 1 comment 2 replies

mattbit
Nov 15, 2024
Maintainer

gandrenacci Nov 15, 2024
Author

mattbit Nov 15, 2024
Maintainer