wordcloud comparison behaviour needs possible rethink #1

kbenoit · 2020-10-21T07:50:49Z

I had not fully realized the implications so far from our reliance on code from wordcloud::comparison.cloud(), which according to its man page does the following:

Let p_{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑_ip_{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.

So words that occur at the same rate across partitions are not mapped, and each word is mapped only to one partition. If comparing three groups for instance, where two talk a lot about "x", and a third about "y", then while group three will have "x" plotted for it, only one of group one or two will have "x". And if they use "x" at the same rates, neither will have it plotted.

I can think of many reasons why we would want to change this behaviour, or at least provide alternative options.

library("quanteda")
## Package version: 2.1.2

dfmat <- as.dfm(
  matrix(c(
    1, 2, 3, 2, 1,
    3, 2, 1, 2, 3
  ),
  nrow = 2,
  dimnames = list(c("d1", "d2"), letters[1:5]), byrow = TRUE
  )
)
dfmat
## Document-feature matrix of: 2 documents, 5 features (0.0% sparse).
##     features
## docs a b c d e
##   d1 1 2 3 2 1
##   d2 3 2 1 2 3

# all are same size
textplot_wordcloud(dfmat, min_count = 1)

# three different sizes
textplot_wordcloud(dfmat[1, ], min_count = 1)

# empty because there is no "maximum deviation" across documents
textplot_wordcloud(dfmat[c(1, 1), ], min_count = 1, comparison = TRUE)
## Error in graphics::strwidth(word[i], cex = size[i]): invalid 'cex' value

# was this what we were expecting?
textplot_wordcloud(dfmat, min_count = 1, comparison = TRUE)

The text was updated successfully, but these errors were encountered:

kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordcloud comparison behaviour needs possible rethink #1

wordcloud comparison behaviour needs possible rethink #1

kbenoit commented Oct 21, 2020

wordcloud comparison behaviour needs possible rethink #1

wordcloud comparison behaviour needs possible rethink #1

Comments

kbenoit commented Oct 21, 2020