You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I realized that the NRC dictionary in spanish has multiple appearances for the same word in the same sentiment. I look into the structure for the original translated dictionaries at http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm and figured out that this has a same word in spanish being used to translate several words.
For instance, the word "asesino" serves as translation for the words "assassin", "cutthroat", "murderer", "murderous" and "slayer" which in turn, due to data structure, returns incorrect data when being used with the packages functions.
The output for get_nrc_sentiment(char_v = c("mira, un asesino"), language = "spanish")
is anger = 5, disgust = 3, fear = 5, sadness = 4, surprise = 2 and negative = 5; while the output for get_nrc_sentiment(char_v = c("look, an assassin"), language = "english") is anger = 1, fear = 1, sadness = 1 and negative = 1.
I think it'd be look if you could fix this, for the package is quite useful. I would expect this same issue to appear in other translation.
The text was updated successfully, but these errors were encountered:
That is an interesting bug and not easily dealt with since it is a problem of the dictionary having multiple instances of the word. The get_nrc_sentiment() function implements get_nrc_values() and that seems to be where the issue is most easily seen. E.g.
get_nrc_values("asesino", "spanish")
produces the results you describe. To see more precisely what is going on:
sp_dict <- get_sentiment_dictionary(dictionary="nrc",language="spanish")
filter(sp_dict, word == "asesino")
Since the goal of syuzhet is not translation, I think the solution is to simply deduplicate the dictionary and only allow 1 instance of, for example, "asesino" as "fear" instead of the five instances that exist now. But rather than tweak the dictionary, I think a simple solution is to use mean instead of sum in the get_nrc_values() function. Instead of the line:
data <- dplyr::summarise_at(data, "value", sum)
data <- dplyr::summarise_at(data, "value", mean)
Which has the same effect as this:
sp_dict <- get_sentiment_dictionary(dictionary="nrc",language="spanish")
filter(sp_dict, word == "asesino") %>%
group_by(sentiment) %>%
summarize(mean(value))
I realized that the NRC dictionary in spanish has multiple appearances for the same word in the same sentiment. I look into the structure for the original translated dictionaries at http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm and figured out that this has a same word in spanish being used to translate several words.
For instance, the word "asesino" serves as translation for the words "assassin", "cutthroat", "murderer", "murderous" and "slayer" which in turn, due to data structure, returns incorrect data when being used with the packages functions.
The output for
get_nrc_sentiment(char_v = c("mira, un asesino"), language = "spanish")
is anger = 5, disgust = 3, fear = 5, sadness = 4, surprise = 2 and negative = 5; while the output for
get_nrc_sentiment(char_v = c("look, an assassin"), language = "english")
is anger = 1, fear = 1, sadness = 1 and negative = 1.I think it'd be look if you could fix this, for the package is quite useful. I would expect this same issue to appear in other translation.
The text was updated successfully, but these errors were encountered: