Problem with translated dictionaries (spanish) #31

fronoso · 2019-10-11T14:48:06Z

I realized that the NRC dictionary in spanish has multiple appearances for the same word in the same sentiment. I look into the structure for the original translated dictionaries at http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm and figured out that this has a same word in spanish being used to translate several words.

For instance, the word "asesino" serves as translation for the words "assassin", "cutthroat", "murderer", "murderous" and "slayer" which in turn, due to data structure, returns incorrect data when being used with the packages functions.

The output for get_nrc_sentiment(char_v = c("mira, un asesino"), language = "spanish")
is anger = 5, disgust = 3, fear = 5, sadness = 4, surprise = 2 and negative = 5; while the output for get_nrc_sentiment(char_v = c("look, an assassin"), language = "english") is anger = 1, fear = 1, sadness = 1 and negative = 1.

I think it'd be look if you could fix this, for the package is quite useful. I would expect this same issue to appear in other translation.

The text was updated successfully, but these errors were encountered:

mjockers · 2019-10-23T13:25:10Z

That is an interesting bug and not easily dealt with since it is a problem of the dictionary having multiple instances of the word. The get_nrc_sentiment() function implements get_nrc_values() and that seems to be where the issue is most easily seen. E.g.

get_nrc_values("asesino", "spanish")

produces the results you describe. To see more precisely what is going on:

sp_dict <- get_sentiment_dictionary(dictionary="nrc",language="spanish")
filter(sp_dict, word == "asesino")

Since the goal of syuzhet is not translation, I think the solution is to simply deduplicate the dictionary and only allow 1 instance of, for example, "asesino" as "fear" instead of the five instances that exist now. But rather than tweak the dictionary, I think a simple solution is to use mean instead of sum in the get_nrc_values() function. Instead of the line:

data <- dplyr::summarise_at(data, "value", sum)
data <- dplyr::summarise_at(data, "value", mean)

Which has the same effect as this:
sp_dict <- get_sentiment_dictionary(dictionary="nrc",language="spanish")
filter(sp_dict, word == "asesino") %>%
group_by(sentiment) %>%
summarize(mean(value))

I'll need to test this solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with translated dictionaries (spanish) #31

Problem with translated dictionaries (spanish) #31

fronoso commented Oct 11, 2019

mjockers commented Oct 23, 2019

Problem with translated dictionaries (spanish) #31

Problem with translated dictionaries (spanish) #31

Comments

fronoso commented Oct 11, 2019

mjockers commented Oct 23, 2019