Skip to content

UserWarning: n_neighbors is larger than the dataset size #2070

Answered by mattbit
gandrenacci asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @gandrenacci, there is in principle nothing to worry about, this happens when the knowledge base is small.

When finding topics in a knowledge base, we calculate the embedding for each document chunk and then use UMAP to reduce their dimensionality before applying the HDBSCAN clustering algorithm. We do this step because HDBSCAN does not work well on high dimensional data.

The parameters for UMAP are hard coded empirically based on what we observed would work on common knowledge bases, in particular we set n_neighbors=50. This parameter defines the tradeoff between local and global structure, and since in topic clustering we are mostly interested in coarse structures we set it to a rela…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@gandrenacci
Comment options

@mattbit
Comment options

Answer selected by alexcombessie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants