issues with get_nrc_sentiment #36

sutravekruttika · 2021-01-09T22:04:27Z

Hi,
I am trying to perform sentiment analysis using the NRC lexicon on Twitter data however when I use get_nrc_sentiment it takes too long to compute. I do have a huge dataset.

How can I reduce the time consumption?
Please advise. Also, I am new to R.
Thank you.

FelixPeckitt · 2021-01-09T22:14:11Z

Hi thanks for raising your issue, welcome to using R! Do you have a sample of code shows the issue you are facing?

sutravekruttika · 2021-01-09T22:25:32Z

I am using the following code. I have about a million tweets.
f_clean_tweets <- function (tweets) {

clean_tweets = gsub('(RT|via)((?:\b\W*@\w+)+)', '', tweets)
clean_tweets = gsub('@\w+', '', clean_tweets)
clean_tweets = gsub('[[:punct:]]', '', clean_tweets)
clean_tweets = gsub('[[:digit:]]', '', clean_tweets)
clean_tweets = gsub('http\w+', '', clean_tweets)
clean_tweets = gsub('[ \t]{2,}', '', clean_tweets)
clean_tweets = gsub('^\s+|\s+$', '', clean_tweets)
clean_tweets = gsub('<.*>', '', enc2native(clean_tweets))
clean_tweets = tolower(clean_tweets)

clean_tweets
}

text_data = df_new$text
clean_tweets <- f_clean_tweets(text_data)
emotions <- get_nrc_sentiment(clean_tweets)

FelixPeckitt · 2021-01-09T23:10:53Z

Thanks for the code sample. So I’m assuming the cleansing is working fine, and it’s the get_nrc_sentiment that is taking up the most time - is that correct, and you can run the code on a subset of your million tweets?

depending on what machine you are running your code on, you could partition the tweets into different groups, perhaps by starting letter or range of letters, then run this in parallel. https://www.r-bloggers.com/2017/10/running-r-code-in-parallel/
Alternatively, the simple approach of you are struggling to find the hardware would be to run a partition one at a time, saving the results to file or workspace, then combining afterwards. This would have the advantage verifying that your code is running, but would require more effort on your part.

Apart from running this code on a more powerful cloud instance, all I can suggest is leaving it to run overnight.

i hope this helps!

sutravekruttika · 2021-01-10T17:55:39Z

Yes, the cleansing is fine. get_nrc_sentiment took hours to complete on a subset of my data(~200k tweets). I got the results when I left it to run for a couple of hours. Looks like I will just repeat this process on small chunks of data. Thank you for pointing me in the right direction.

FelixPeckitt · 2021-01-11T13:37:02Z

No problem. If you come up with something that helps, do post a snippet back so it can help others

luisignaciomenendez · 2021-04-16T10:24:04Z

Yes, the cleansing is fine. get_nrc_sentiment took hours to complete on a subset of my data(~200k tweets). I got the results when I left it to run for a couple of hours. Looks like I will just repeat this process on small chunks of data. Thank you for pointing me in the right direction.

Hello

I am facing the same issues here. My data consists of 340.000 tweets approx. and I am trying to use the get_nrc_sentiment on it. However, I have timed the code to see an estimation of the total time (as I left it overnight but it didnt finish), and I get that it would last about 14 days in my case.

As you mentioned it took some hours in your comment I wondered if there is something wrong or maybe if someone came up with a solution ( also whether someone has tried parallelisation successfully)? Is it normal to last that much?

This is my current code: emotions = get_nrc_sentiment(blm2$stripped_text) which takes 3.6 seconds per tweet.

Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with get_nrc_sentiment #36

issues with get_nrc_sentiment #36

sutravekruttika commented Jan 9, 2021

FelixPeckitt commented Jan 9, 2021

sutravekruttika commented Jan 9, 2021 •

edited

Loading

FelixPeckitt commented Jan 9, 2021

sutravekruttika commented Jan 10, 2021

FelixPeckitt commented Jan 11, 2021

luisignaciomenendez commented Apr 16, 2021

issues with get_nrc_sentiment #36

issues with get_nrc_sentiment #36

Comments

sutravekruttika commented Jan 9, 2021

FelixPeckitt commented Jan 9, 2021

sutravekruttika commented Jan 9, 2021 • edited Loading

FelixPeckitt commented Jan 9, 2021

sutravekruttika commented Jan 10, 2021

FelixPeckitt commented Jan 11, 2021

luisignaciomenendez commented Apr 16, 2021

sutravekruttika commented Jan 9, 2021 •

edited

Loading