Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with translated dictionaries (spanish) #31

Open
fronoso opened this issue Oct 11, 2019 · 1 comment
Open

Problem with translated dictionaries (spanish) #31

fronoso opened this issue Oct 11, 2019 · 1 comment

Comments

@fronoso
Copy link

fronoso commented Oct 11, 2019

I realized that the NRC dictionary in spanish has multiple appearances for the same word in the same sentiment. I look into the structure for the original translated dictionaries at http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm and figured out that this has a same word in spanish being used to translate several words.

For instance, the word "asesino" serves as translation for the words "assassin", "cutthroat", "murderer", "murderous" and "slayer" which in turn, due to data structure, returns incorrect data when being used with the packages functions.

The output for get_nrc_sentiment(char_v = c("mira, un asesino"), language = "spanish")
is anger = 5, disgust = 3, fear = 5, sadness = 4, surprise = 2 and negative = 5; while the output for get_nrc_sentiment(char_v = c("look, an assassin"), language = "english") is anger = 1, fear = 1, sadness = 1 and negative = 1.

I think it'd be look if you could fix this, for the package is quite useful. I would expect this same issue to appear in other translation.

@mjockers
Copy link
Owner

That is an interesting bug and not easily dealt with since it is a problem of the dictionary having multiple instances of the word. The get_nrc_sentiment() function implements get_nrc_values() and that seems to be where the issue is most easily seen. E.g.

get_nrc_values("asesino", "spanish")

produces the results you describe. To see more precisely what is going on:

sp_dict <- get_sentiment_dictionary(dictionary="nrc",language="spanish")
filter(sp_dict, word == "asesino")

Since the goal of syuzhet is not translation, I think the solution is to simply deduplicate the dictionary and only allow 1 instance of, for example, "asesino" as "fear" instead of the five instances that exist now. But rather than tweak the dictionary, I think a simple solution is to use mean instead of sum in the get_nrc_values() function. Instead of the line:

data <- dplyr::summarise_at(data, "value", sum)
data <- dplyr::summarise_at(data, "value", mean)

Which has the same effect as this:
sp_dict <- get_sentiment_dictionary(dictionary="nrc",language="spanish")
filter(sp_dict, word == "asesino") %>%
group_by(sentiment) %>%
summarize(mean(value))

I'll need to test this solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants