How can you use this to cluster textual data #7

drreddy · 2015-09-01T11:58:34Z

Hello,

I'am interested to know is there any provision to cluster text data, I see that you are taking coordinates as input. Can you explain how to extend it for clustering a set of text documents.

mgaido91 · 2015-09-01T13:21:41Z

The usual way to cluster text documents is made as follows:
1 - the data is transformed using the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) normalization
2 - the proper clustering technique (as DBSCAN can be) is applied to the transformed data, by using the cosine distance measure

As far as the TF-IDF transformation is regarded, you've to implement it by your own or to look to existing implementations. Instead, for the DBSCAN you can use this fork by me (https://github.com/speedymrk9/spark_dbscan) which extends @alitouka's repo allowing the usage of the cosine distance measure, which is the one you need.

Hope I've been clear, I'm at your disposal for any doubt.

drreddy · 2015-09-01T15:13:02Z

Hey thanks for the quick reply, but if the cosine distance is found the memory complexity would be O(n^2) right. I meant is there any way one can convert the text data into the coordinates your package requires. So that I can use your efficient implementation.

mgaido91 · 2015-09-01T16:33:45Z

Sorry but I have not understood what you've said. May you explain me? The way to convert text documents into coordinates is the TFIDF transformation.

----- Messaggio originale -----
Da: "D. Rajeev. Reddy" [email protected]
Inviato: ‎01/‎09/‎2015 17:13
A: "alitouka/spark_dbscan" [email protected]
Cc: "Marco Gaido" [email protected]
Oggetto: Re: [spark_dbscan] How can you use this to cluster textual data (#7)

but if the cosine distance is found the memory complexity would be O(n^2) right. I meant is there any way one can convert the text data into the coordinates your package requires. So that I can use your efficient implementation.
—
Reply to this email directly or view it on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can you use this to cluster textual data #7

How can you use this to cluster textual data #7

drreddy commented Sep 1, 2015

mgaido91 commented Sep 1, 2015

drreddy commented Sep 1, 2015

mgaido91 commented Sep 1, 2015

How can you use this to cluster textual data #7

How can you use this to cluster textual data #7

Comments

drreddy commented Sep 1, 2015

mgaido91 commented Sep 1, 2015

drreddy commented Sep 1, 2015

mgaido91 commented Sep 1, 2015