Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with get_nrc_sentiment #36

Open
sutravekruttika opened this issue Jan 9, 2021 · 6 comments
Open

issues with get_nrc_sentiment #36

sutravekruttika opened this issue Jan 9, 2021 · 6 comments

Comments

@sutravekruttika
Copy link

Hi,
I am trying to perform sentiment analysis using the NRC lexicon on Twitter data however when I use get_nrc_sentiment it takes too long to compute. I do have a huge dataset.

How can I reduce the time consumption?
Please advise. Also, I am new to R.
Thank you.

@FelixPeckitt
Copy link

Hi thanks for raising your issue, welcome to using R! Do you have a sample of code shows the issue you are facing?

@sutravekruttika
Copy link
Author

sutravekruttika commented Jan 9, 2021

I am using the following code. I have about a million tweets.
f_clean_tweets <- function (tweets) {

clean_tweets = gsub('(RT|via)((?:\b\W*@\w+)+)', '', tweets)
clean_tweets = gsub('@\w+', '', clean_tweets)
clean_tweets = gsub('[[:punct:]]', '', clean_tweets)
clean_tweets = gsub('[[:digit:]]', '', clean_tweets)
clean_tweets = gsub('http\w+', '', clean_tweets)
clean_tweets = gsub('[ \t]{2,}', '', clean_tweets)
clean_tweets = gsub('^\s+|\s+$', '', clean_tweets)
clean_tweets = gsub('<.*>', '', enc2native(clean_tweets))
clean_tweets = tolower(clean_tweets)

clean_tweets
}

text_data = df_new$text
clean_tweets <- f_clean_tweets(text_data)
emotions <- get_nrc_sentiment(clean_tweets)

@FelixPeckitt
Copy link

Thanks for the code sample. So I’m assuming the cleansing is working fine, and it’s the get_nrc_sentiment that is taking up the most time - is that correct, and you can run the code on a subset of your million tweets?

depending on what machine you are running your code on, you could partition the tweets into different groups, perhaps by starting letter or range of letters, then run this in parallel. https://www.r-bloggers.com/2017/10/running-r-code-in-parallel/
Alternatively, the simple approach of you are struggling to find the hardware would be to run a partition one at a time, saving the results to file or workspace, then combining afterwards. This would have the advantage verifying that your code is running, but would require more effort on your part.

Apart from running this code on a more powerful cloud instance, all I can suggest is leaving it to run overnight.

i hope this helps!

@sutravekruttika
Copy link
Author

Yes, the cleansing is fine. get_nrc_sentiment took hours to complete on a subset of my data(~200k tweets). I got the results when I left it to run for a couple of hours. Looks like I will just repeat this process on small chunks of data. Thank you for pointing me in the right direction.

@FelixPeckitt
Copy link

No problem. If you come up with something that helps, do post a snippet back so it can help others

@luisignaciomenendez
Copy link

Yes, the cleansing is fine. get_nrc_sentiment took hours to complete on a subset of my data(~200k tweets). I got the results when I left it to run for a couple of hours. Looks like I will just repeat this process on small chunks of data. Thank you for pointing me in the right direction.

Hello

I am facing the same issues here. My data consists of 340.000 tweets approx. and I am trying to use the get_nrc_sentiment on it. However, I have timed the code to see an estimation of the total time (as I left it overnight but it didnt finish), and I get that it would last about 14 days in my case.

As you mentioned it took some hours in your comment I wondered if there is something wrong or maybe if someone came up with a solution ( also whether someone has tried parallelisation successfully)? Is it normal to last that much?

This is my current code: emotions = get_nrc_sentiment(blm2$stripped_text) which takes 3.6 seconds per tweet.

Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants