UserWarning: n_neighbors is larger than the dataset size #2070
-
Hi,
This warning occurs while finding topics in the knowledge base. It seems that the n_neighbors parameter is set to a value larger than the dataset size, causing UMAP to adjust it automatically to avoid errors. Could you clarify if this is expected behavior, or if there’s a recommended parameter setting to avoid this warning? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @gandrenacci, there is in principle nothing to worry about, this happens when the knowledge base is small. When finding topics in a knowledge base, we calculate the embedding for each document chunk and then use UMAP to reduce their dimensionality before applying the HDBSCAN clustering algorithm. We do this step because HDBSCAN does not work well on high dimensional data. The parameters for UMAP are hard coded empirically based on what we observed would work on common knowledge bases, in particular we set When you knowledge base has less than 50 documents, UMAP will raise this warning and reset this parameter. In this case UMAP will operate fully on the global structure, potentially losing some local details. For the topic clustering this is ok, and you can safely ignore that warning. It is worth to double-checking if it is expected for your knowledge base to be that small. Note that when you are creating the knowledge base with RAGET, it's better to provide the already chunked documents than the raw text. For example, you may use langchain text splitters or any other method used by your RAG to obtain the chunks, and initialize the |
Beta Was this translation helpful? Give feedback.
Hi @gandrenacci, there is in principle nothing to worry about, this happens when the knowledge base is small.
When finding topics in a knowledge base, we calculate the embedding for each document chunk and then use UMAP to reduce their dimensionality before applying the HDBSCAN clustering algorithm. We do this step because HDBSCAN does not work well on high dimensional data.
The parameters for UMAP are hard coded empirically based on what we observed would work on common knowledge bases, in particular we set
n_neighbors=50
. This parameter defines the tradeoff between local and global structure, and since in topic clustering we are mostly interested in coarse structures we set it to a rela…