Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordcloud comparison behaviour needs possible rethink #1

Open
kbenoit opened this issue Oct 21, 2020 · 0 comments
Open

wordcloud comparison behaviour needs possible rethink #1

kbenoit opened this issue Oct 21, 2020 · 0 comments

Comments

@kbenoit
Copy link
Contributor

kbenoit commented Oct 21, 2020

I had not fully realized the implications so far from our reliance on code from wordcloud::comparison.cloud(), which according to its man page does the following:

Let p_{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑_ip_{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.

So words that occur at the same rate across partitions are not mapped, and each word is mapped only to one partition. If comparing three groups for instance, where two talk a lot about "x", and a third about "y", then while group three will have "x" plotted for it, only one of group one or two will have "x". And if they use "x" at the same rates, neither will have it plotted.

I can think of many reasons why we would want to change this behaviour, or at least provide alternative options.

library("quanteda")
## Package version: 2.1.2

dfmat <- as.dfm(
  matrix(c(
    1, 2, 3, 2, 1,
    3, 2, 1, 2, 3
  ),
  nrow = 2,
  dimnames = list(c("d1", "d2"), letters[1:5]), byrow = TRUE
  )
)
dfmat
## Document-feature matrix of: 2 documents, 5 features (0.0% sparse).
##     features
## docs a b c d e
##   d1 1 2 3 2 1
##   d2 3 2 1 2 3

# all are same size
textplot_wordcloud(dfmat, min_count = 1)

# three different sizes
textplot_wordcloud(dfmat[1, ], min_count = 1)

# empty because there is no "maximum deviation" across documents
textplot_wordcloud(dfmat[c(1, 1), ], min_count = 1, comparison = TRUE)
## Error in graphics::strwidth(word[i], cex = size[i]): invalid 'cex' value

# was this what we were expecting?
textplot_wordcloud(dfmat, min_count = 1, comparison = TRUE)
@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant