-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can you use this to cluster textual data #7
Comments
The usual way to cluster text documents is made as follows: As far as the TF-IDF transformation is regarded, you've to implement it by your own or to look to existing implementations. Instead, for the DBSCAN you can use this fork by me (https://github.com/speedymrk9/spark_dbscan) which extends @alitouka's repo allowing the usage of the cosine distance measure, which is the one you need. Hope I've been clear, I'm at your disposal for any doubt. |
Hey thanks for the quick reply, but if the cosine distance is found the memory complexity would be O(n^2) right. I meant is there any way one can convert the text data into the coordinates your package requires. So that I can use your efficient implementation. |
Sorry but I have not understood what you've said. May you explain me? The way to convert text documents into coordinates is the TFIDF transformation. ----- Messaggio originale ----- but if the cosine distance is found the memory complexity would be O(n^2) right. I meant is there any way one can convert the text data into the coordinates your package requires. So that I can use your efficient implementation. |
Hello,
I'am interested to know is there any provision to cluster text data, I see that you are taking coordinates as input. Can you explain how to extend it for clustering a set of text documents.
The text was updated successfully, but these errors were encountered: