Skip to content
This repository has been archived by the owner on Jul 8, 2024. It is now read-only.

HTTP Error, Gives 404 but the URL is working #98

Open
sagefuentes opened this issue Sep 18, 2020 · 144 comments
Open

HTTP Error, Gives 404 but the URL is working #98

sagefuentes opened this issue Sep 18, 2020 · 144 comments

Comments

@sagefuentes
Copy link

Hi, I had a script running over the past weeks and earlier today it stopped working. I keep receiving HTTPError 404, but the provided link in the errors still brings me to a valid page.
Code is (all mentioned variables are established and the error specifically happens with the Manager when I check via debugging):
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(term)\ .setMaxTweets(max_count)\ .setSince(begin_timeframe)\ .setUntil(end_timeframe) scraped_tweets = got.manager.TweetManager.getTweets(tweetCriteria)

The error message for this is the standard 404 error
"An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser:" followed by the valid link

As I have changed nothing about the folder, I am wondering if something has happened with my configurations more so than anything else, but wondering if others are experiencing this.

@alberto-valdes
Copy link

alberto-valdes commented Sep 18, 2020

Hello @sagefuentes, I'm dealing with the exact same issue, I also have been downloading tweets for the past weeks and it suddenly stops working giving me error 404 with a valid link.

I've no idea what might be the cause...

@caiyishu
Copy link

So like me. I also suddenly encounter this problem today, but all things went well yesterday.
下載

@taoyudong
Copy link

I am dealing with the same issue here. This is something new today and is caused by some changes/bugs on Twitter server side. If using the command with debug=True, the URL used to get tweets is no longer available. Seeking for solution now.

@mwaters166
Copy link

Also started having the same issue today.

@MithilaGuha
Copy link

MithilaGuha commented Sep 18, 2020

I'm having the same issue as well! Does anyone have a solution for it?

@baraths92
Copy link

Yes. I am having the same issue. Guess everyone are having the issue.

@alastairrushworth
Copy link

I'm not sure if it is related to this issue, but some of the user_agents seem to be out of date

    user_agents = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:62.0) Gecko/20100101 Firefox/62.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15',
    ]

@stevedwards
Copy link

same

@Sebastokratos42
Copy link

Seems to be a "bigger" problem? Also other scrappers have problems.
twintproject/twint#915 (comment)

@Daviey
Copy link

Daviey commented Sep 18, 2020

Here is debug enabled. It shows the actual url being called, and it seems that twitter has removed the /i/search/timeline endpoint. :(

https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3AREDACTED&src=typd

@danielo93
Copy link

Same problem, damn

@inactivist
Copy link

inactivist commented Sep 18, 2020

I'm not sure if it is related to this issue, but some of the user_agents seem to be out of date

I forked and created a branch to allow a user-specified UA, using samples from my current browser doesn't fix the problem.

I notice the search and referrer URL shown in--debug output (https://twitter.com/i/search/timeline) returns a 404 error:

$ GetOldTweets3 --username twitter --debug 
/home/inactivist/.local/bin/GetOldTweets3 --username twitter --debug
GetOldTweets3 0.0.11
Downloading tweets...
https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3Atwitter&src=typd
$ curl -I https://twitter.com/i/search/timeline
HTTP/2 404 
[snip]

EDIT The url used for the internal search, and the one shown in the exception message, aren't the same...

@baraths92
Copy link

I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.

@herdemo
Copy link

herdemo commented Sep 18, 2020

Unfortunately i have same problem, i hope we find a solution as soon as possible.

@inactivist
Copy link

inactivist commented Sep 18, 2020

I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.

Switching to mobile.twitter.com/search and using a modern User-Agent header seems to get us past the 400 bad request error, but then we get Error parsing JSON...

@rennanharo
Copy link

Same thing for me. I get an error 404 but the URL is working.

@shelu16
Copy link

shelu16 commented Sep 19, 2020

I have same issue

@maldil
Copy link

maldil commented Sep 19, 2020

I am experiencing the same issue. Any plan to fix the issue?

@alifzl
Copy link

alifzl commented Sep 19, 2020

same issue, somebody help.

@chinmuxmaximus
Copy link

Same issue. The same code was working a day back now its giving error 404 with a valid link

@GabrielEspeschit
Copy link

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

@rsafa
Copy link

rsafa commented Sep 19, 2020

I am having the same issue. It was more robust than Tweepy. I hope we find a solution as soon as possible.

@herdemo
Copy link

herdemo commented Sep 19, 2020

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api.
I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

@Fiyinfoluwa6
Copy link

I have same issue. Need some help here

@GabrielEspeschit
Copy link

GabrielEspeschit commented Sep 20, 2020

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api.
I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

I see! I'm fairly new to scrapping, but I'm working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.

I've been tinkering with GOT3's code a bit and got it to read the HTML of the search timeline, however it's mostly unformatted. Like I said, I have little experience with scrapping so I'm really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:

  • updated user_agents (updated with the ones used by TWINT);

  • updated endpoint (/search?)

  • some updates to the URL structure:

      url = "https://twitter.com/search?"

        

        url += ("q=%%20%s&src=typd%s"
                "&include_available_features=1&include_entities=1&max_position=%s"
                "&reset_error_state=false")

        if not tweetCriteria.topTweets:
            url += "&f=live"`

Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.

@burakoglakci
Copy link

burakoglakci commented Nov 14, 2020

Thanks you so much @sufyanhamid I'm happy if it helped.
As far as I know, the bounding box query cannot be run on snscrape, as in the Twitter Stream API. You can use the geocode query instead as in Twitter Rest API.
Ex.

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('geocode:40.682299,-73.944852,5mi + since:2020-10-31 until:2020-11-03 -filter:links -filter:replies').get_items()):
        if i > maxTweets :
            break

With this query, you can collect tweets within 5 miles, surrounding the point coordinate you specify. As far as I know, you can write till 15 miles.

@sufyanhamid
Copy link

@sbif

Hi guys!
I'm totally lost: how can I use snscrape to extract tweet from a user in a specific time lapse?
I'm a beginner with Python, I have to do this for my thesis: It's three weeks I'm trying to extract this data without success, I tried with tweepy and than with GetOldTweets3 and I've just discovered about this new TwitterApi limit...
Can somebody help me please?

Use this query with snscrape:

import snscrape.modules.twitter as sntwitter
import csv
maxTweets = 3000

csvFile = open('place_result.csv', 'a', newline='', encoding='utf8')

csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet',])

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:@billgates + since:2015-12-02 until:2020-11-05-filter:links -filter:replies').get_items()):
if i > maxTweets :
break
csvWriter.writerow([tweet.id, tweet.date, tweet.content])
csvFile.close()

@burakoglakci Thanks for reply. One more thing that what is the query to fetch the Number of( Comments, Retweets, Likes ) also where I can learn that how to write the query using snstwitter. Kindly share this point as well.

@iuserea
Copy link

iuserea commented Nov 15, 2020

@burakoglakci Thank you for sharing your code!But when I run it,the error below happened.My computer is in China,and I can get to tweet only by using VPN.Could you help me figure it out?

Error retrieving https://twitter.com/search?f=live&lang=en&q=deprem+%2B+place%3A5e02a0f0d91c76d2+%2B+since%3A2020-10-31+until%3A2020-11-03+-filter%3Alinks+-filter%3Areplies&src=spelling_expansion_revert_click: ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /search?f=live&lang=en&q=deprem+%2B+place%3A5e02a0f0d91c76d2+%2B+since%3A2020-10-31+until%3A2020-11-03+-filter%3Alinks+-filter%3Areplies&src=spelling_expansion_revert_click (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000020FFB1C8D30>, 'Connection to twitter.com timed out. (connect timeout=10)'))")), retrying

@Woolwit
Copy link

Woolwit commented Nov 15, 2020

Anyone have a tip for getting all the tweets in an individual's timeline? Have managed to get user tweets (thank you @burakoglakci for your example) but would like to get the tweets the user retweets as well (tweet.retweetedTweet didn't get it). And for any other noobish coders out there, just in case this helps.

import snscrape.modules.twitter as sntwitter

maxTweets = 10

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:@TwitterSupport').get_items()) :
        if i > maxTweets :
            break  
        print(f"the date is {tweet.date}")  
        print(f"the user name is {tweet.user.username}")
        print(f"the tweet content is {tweet.content}")
        print(f"the tweet rendered content is {tweet.renderedContent}")
        print(f"the outlinks are {tweet.outlinks}")  
        print(f"the tco outlinks are {tweet.tcooutlinks}") 
        print(f"the url is {tweet.url}")
        print(f"the retweeted tweet is  {tweet.retweetedTweet}")        
        print(f"the quoted tweet is  {tweet.quotedTweet}") 

@lorenzopetra96
Copy link

@TamiresMonteiroCD @WelXingz @ahsanspark @Atoxal @SophieChowZZY
I think I solved the problem. I made a few changes to the lines. I collect tweets using a word and location filter. I'm using Python 3.8.6 on Windows 10 and it works fine right now.

import snscrape.modules.twitter as sntwitter
import csv
maxTweets = 3000

#keyword = 'deprem'
#place = '5e02a0f0d91c76d2' #This geo_place string corresponds to İstanbul, Turkey on twitter.

#keyword = 'covid'
#place = '01fbe706f872cb32' This geo_place string corresponds to Washington DC on twitter.

#Open/create a file to append data to
csvFile = open('place_result.csv', 'a', newline='', encoding='utf8')

#Use csv writer
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet',]) 

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('deprem + place:5e02a0f0d91c76d2 + since:2020-10-31 until:2020-11-03 -filter:links -filter:replies').get_items()):
        if i > maxTweets :
            break  
        csvWriter.writerow([tweet.id, tweet.date, tweet.content])
csvFile.close()

@burakoglakci thanks for sharing your experience and work with us!! Its really appreciable and its help me a lot.
I want to ask that what will the query string (using snscraper) if we want to get the tweets according to longitude and latitude also how we can find the geo-location of any city/country on twitter.
Thanks in advance :)

Hi!
First of all, I am so grateful for all the support, thank you.
I have a problem with place id. I need the Arizona and Florida place id but I cannot find them. Can anyone tell me how I can take these (and other place Id), please?
Thanks in advance <3

@J-t-p
Copy link

J-t-p commented Nov 16, 2020

I found another simple alternative in case people are having trouble with snscrape. It involves the requests and bs4 (Beautiful Soup) libraries
`
import requests
from bs4 import BeautifulSoup

contents = requests.get("https://mobile.twitter.com/username")
soup = BeautifulSoup(contents.text, "html.parser")

tweets = soup.find_all("tr", {"class":"tweet-container"})
latest = tweets[0]
print(latest.text)
`

This will give you a list with the html for, if my counting is correct, the last 20 tweets from that account. Obviously, this will not be very useful if you need more than that, but if you don't, then this should work until GOT3 is fixed.

A few things to note: 1. You have to use the mobile link. It does not work with the normal link. (This code can still be ran on a desktop computer even with the mobile link) 2. You can use .text to print/store the tweet to a variable without all the html code.

As you can see, this code is very bare bones, so feel free to play around with it and add anything I missed or that you think would be useful.

@sufyanhamid
Copy link

Anyone have a tip for getting all the tweets in an individual's timeline? Have managed to get user tweets (thank you @burakoglakci for your example) but would like to get the tweets the user retweets as well (tweet.retweetedTweet didn't get it). And for any other noobish coders out there, just in case this helps.

import snscrape.modules.twitter as sntwitter

maxTweets = 10

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:@TwitterSupport').get_items()) :
        if i > maxTweets :
            break  
        print(f"the date is {tweet.date}")  
        print(f"the user name is {tweet.user.username}")
        print(f"the tweet content is {tweet.content}")
        print(f"the tweet rendered content is {tweet.renderedContent}")
        print(f"the outlinks are {tweet.outlinks}")  
        print(f"the tco outlinks are {tweet.tcooutlinks}") 
        print(f"the url is {tweet.url}")
        print(f"the retweeted tweet is  {tweet.retweetedTweet}")        
        print(f"the quoted tweet is  {tweet.quotedTweet}") 

@Woolwit Thanks for share the more attributes of a tweets. Kindly also share the code/qurey of that how we can get the no.likes, no.retweets, no.comments.
Thanks in advance.

@burakoglakci
Copy link

@TamiresMonteiroCD @WelXingz @ahsanspark @Atoxal @SophieChowZZY **Sanırım sorunu **çözdüm
. Hatlarda birkaç değişiklik yaptım. Bir kelime ve konum filtresi kullanarak tweet topluyorum. Windows 10'da Python 3.8.6 kullanıyorum ve şu anda iyi çalışıyor.

import snscrape.modules.twitter as sntwitter
import csv
maxTweets = 3000

#keyword = 'deprem'
#place = '5e02a0f0d91c76d2' #This geo_place string corresponds to İstanbul, Turkey on twitter.

#keyword = 'covid'
#place = '01fbe706f872cb32' This geo_place string corresponds to Washington DC on twitter.

#Open/create a file to append data to
csvFile = open('place_result.csv', 'a', newline='', encoding='utf8')

#Use csv writer
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet',]) 

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('deprem + place:5e02a0f0d91c76d2 + since:2020-10-31 until:2020-11-03 -filter:links -filter:replies').get_items()):
        if i > maxTweets :
            break  
        csvWriter.writerow([tweet.id, tweet.date, tweet.content])
csvFile.close()

@burakoğlakcı deneyiminizi paylaştığınız ve bizimle çalıştığınız için teşekkürler !! Gerçekten takdire şayan ve bana çok yardımcı oluyor.
Eğer tweetleri enlem ve boylamlara göre almak istersek, ayrıca herhangi bir şehrin / ülkenin coğrafi konumunu twitter üzerinden nasıl bulabileceğimizi sormak istiyorum (snscraper kullanarak) sorgu dizesi ne olacak?
Şimdiden teşekkürler :)

Selam!
Öncelikle tüm desteğiniz için çok minnettarım, teşekkür ederim.
Yer kimliğiyle ilgili bir sorunum var. Arizona ve Florida yer kimliğine ihtiyacım var ama bulamıyorum. Biri bana bunları (ve başka yer kimliğini) nasıl alabileceğimi söyleyebilir mi, lütfen?
Şimdiden teşekkürler <3

Arizona USA id: a612c69b44b2e5da

Florida USA id: 4ec01c9dbc693497
to find these IDs, you have to run geocode query on twitter. Ex. geocode:34.684879,-111.699645,1mi this cooordinates allow you to search for a point location in Arizona. you can use any map service to access coordinates. then click on the content of a tweet that appears as a result of this query. you will see arizona, USA as the place name on this tweet content, if not, review another tweet. after clicking on the place name, you will see the place ID on the link in the search bar.

@csbhakat
Copy link

What is the code to get tweet likes and ,.retweet count?
I have tried with tweet.favorite_count,tweet.retweet_count, but no luck ?

@jscas88
Copy link

jscas88 commented Nov 19, 2020

Hello all !
I am a beginner with python & coding in general.
Do you think GOT will be updated anytime soon in order to resume timelines' scraping ?
Also, how to get more information out of the tweets currently extractable thanks to @burakoglakci and the use of snscrape ? Is it possible to get the number of likes, replies, etc. to tweets for example ?
I used the following code and it works fine thanks to all of you who offered an alternative to continue scraping Twitter 👍

import` snscrape.modules.twitter as sntwitter
import csv
csvFile = open('place_result.csv', 'a', newline='', encoding='utf8')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet',]) 
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:@username + since:2009-01-01 until:2020-11-05 -filter:links -filter:replies').get_items()):
    csvWriter.writerow([tweet.id, tweet.date, tweet.content])
csvFile.close()
 

change from:@Username -> keywords:#hashtag to search by keyword as opposed to username

Thanks to all who made this code available! smooth program and helpful for current project!

@burakoglakci
Copy link

@csbhakat

https://medium.com/@jcldinco/downloading-historical-tweets-using-tweet-ids-via-snscrape-and-tweepy-5f4ecbf19032 you can get any tweet objects you want using the method described here. I created a script for my own work, and I share it below. I hope it's useful :) You must have a Twitter developer account to use this method.

import pandas as pd
import tweepy
import csv

consumer_key = "aaaaaaaaaaaaaaaaaaaaa" 
consumer_secret = "aaaaaaaaaaaaaaaaaaaaaaaa" 
access_token = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" 
access_token_secret = "aaaaaaaaaaaaaaaaaaaaaaaaaa"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

tweet_url = pd.read_csv("Your_Text_File.txt", index_col= None,
header = None, names = ["links"])

af = lambda x: x["links"].split("/")[-1]
tweet_url['id'] = tweet_url.apply(af, axis=1)
tweet_url.head()

ids = tweet_url['id'].tolist()
total_count = len(ids)
chunks = (total_count - 1) // 50 + 1

def fetch_tw(ids):
    list_of_tw_status = api.statuses_lookup(ids, tweet_mode= "extended")
    empty_data = pd.DataFrame()
    for status in list_of_tw_status:
            tweet_elem = {"date": status.created_at,
                     "tweet_id":status.id,
                     "tweet":status.full_text,
                     "User location":status.user.location,
                     "Retweet count":status.retweet_count,
                     "Like count":status.favorite_count,
                     "Source":status.source}
            empty_data = empty_data.append(tweet_elem, ignore_index = True)
    empty_data.to_csv("new_tweets.csv", mode="a")

for i in range(chunks):
        batch = ids[i*50:(i+1)*50]
        result = fetch_tw(batch)

@csbhakat
Copy link

@burakoglakci:
Thanks for sharing this

@csbhakat
https://medium.com/@jcldinco/downloading-historical-tweets-using-tweet-ids-via-snscrape-and-tweepy-5f4ecbf19032 you can get any tweet objects you want using the method described here. I created a script for my own work, and I share it below. I hope it's useful :) You must have a Twitter developer account to use this method.

import pandas as pd
import tweepy
import csv

consumer_key = "aaaaaaaaaaaaaaaaaaaaa" 
consumer_secret = "aaaaaaaaaaaaaaaaaaaaaaaa" 
access_token = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" 
access_token_secret = "aaaaaaaaaaaaaaaaaaaaaaaaaa"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

tweet_url = pd.read_csv("Your_Text_File.txt", index_col= None,
header = None, names = ["links"])

af = lambda x: x["links"].split("/")[-1]
tweet_url['id'] = tweet_url.apply(af, axis=1)
tweet_url.head()

ids = tweet_url['id'].tolist()
total_count = len(ids)
chunks = (total_count - 1) // 50 + 1

def fetch_tw(ids):
    list_of_tw_status = api.statuses_lookup(ids, tweet_mode= "extended")
    empty_data = pd.DataFrame()
    for status in list_of_tw_status:
            tweet_elem = {"date": status.created_at,
                     "tweet_id":status.id,
                     "tweet":status.full_text,
                     "User location":status.user.location,
                     "Retweet count":status.retweet_count,
                     "Like count":status.favorite_count,
                     "Source":status.source}
            empty_data = empty_data.append(tweet_elem, ignore_index = True)
    empty_data.to_csv("new_tweets.csv", mode="a")

for i in range(chunks):
        batch = ids[i*50:(i+1)*50]
        result = fetch_tw(batch)

@burakoglakci:
for this code , I need to get all the links and store into the "Your_Text_File.txt" file? and based on that link , this code will scrape the tweet , right?
Suppose , I want to get all tweets from March, 2020 to Oct,2020 for #amazon , then how can I do that ? is your code help in that case ?

@burakoglakci
Copy link

@burakoglakci:
Thanks for sharing this

@csbhakat
https://medium.com/@jcldinco/downloading-historical-tweets-using-tweet-ids-via-snscrape-and-tweepy-5f4ecbf19032 you can get any tweet objects you want using the method described here. I created a script for my own work, and I share it below. I hope it's useful :) You must have a Twitter developer account to use this method.

import pandas as pd
import tweepy
import csv

consumer_key = "aaaaaaaaaaaaaaaaaaaaa" 
consumer_secret = "aaaaaaaaaaaaaaaaaaaaaaaa" 
access_token = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" 
access_token_secret = "aaaaaaaaaaaaaaaaaaaaaaaaaa"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

tweet_url = pd.read_csv("Your_Text_File.txt", index_col= None,
header = None, names = ["links"])

af = lambda x: x["links"].split("/")[-1]
tweet_url['id'] = tweet_url.apply(af, axis=1)
tweet_url.head()

ids = tweet_url['id'].tolist()
total_count = len(ids)
chunks = (total_count - 1) // 50 + 1

def fetch_tw(ids):
    list_of_tw_status = api.statuses_lookup(ids, tweet_mode= "extended")
    empty_data = pd.DataFrame()
    for status in list_of_tw_status:
            tweet_elem = {"date": status.created_at,
                     "tweet_id":status.id,
                     "tweet":status.full_text,
                     "User location":status.user.location,
                     "Retweet count":status.retweet_count,
                     "Like count":status.favorite_count,
                     "Source":status.source}
            empty_data = empty_data.append(tweet_elem, ignore_index = True)
    empty_data.to_csv("new_tweets.csv", mode="a")

for i in range(chunks):
        batch = ids[i*50:(i+1)*50]
        result = fetch_tw(batch)

@burakoglakci:
for this code , I need to get all the links and store into the "Your_Text_File.txt" file? and based on that link , this code will scrape the tweet , right?
Suppose , I want to get all tweets from March, 2020 to Oct,2020 for #amazon , then how can I do that ? is your code help in that case ?

first, use snscrape to collect the tweets you want, including tweet id and links. you can collect your tweets in csv or txt file.

Then collect tweet objects using this code.
The code I share here is based on tweepy. querying using tweet IDs and finding and collect the objects you want(like, retweet).

@DV777
Copy link

DV777 commented Nov 20, 2020

@DV777 Hi!

https://medium.com/@jcldinco/downloading-historical-tweets-using-tweet-ids-via-snscrape-and-tweepy-5f4ecbf19032 you can get any tweet objects you want using the method described here. I created a script for my own work, and I share it below. I hope it's useful :) You must have a Twitter developer account to use this method.

import pandas as pd
import tweepy
import csv

consumer_key = "aaaaaaaaaaaaaaaaaaaaa" 
consumer_secret = "aaaaaaaaaaaaaaaaaaaaaaaa" 
access_token = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" 
access_token_secret = "aaaaaaaaaaaaaaaaaaaaaaaaaa"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

tweet_url = pd.read_csv("Your_Text_File.txt", index_col= None,
header = None, names = ["links"])

af = lambda x: x["links"].split("/")[-1]
tweet_url['id'] = tweet_url.apply(af, axis=1)
tweet_url.head()

ids = tweet_url['id'].tolist()
total_count = len(ids)
chunks = (total_count - 1) // 50 + 1

def fetch_tw(ids):
    list_of_tw_status = api.statuses_lookup(ids, tweet_mode= "extended")
    empty_data = pd.DataFrame()
    for status in list_of_tw_status:
            tweet_elem = {"date": status.created_at,
                     "tweet_id":status.id,
                     "tweet":status.full_text,
                     "User location":status.user.location,
                     "Retweet count":status.retweet_count,
                     "Like count":status.favorite_count,
                     "Source":status.source}
            empty_data = empty_data.append(tweet_elem, ignore_index = True)
    empty_data.to_csv("new_tweets.csv", mode="a")

for i in range(chunks):
        batch = ids[i*50:(i+1)*50]
        result = fetch_tw(batch)

Thanks for your help @burakoglakci , I'd be lost without this.
Thing is when collecting a timeline, I do not get the retweets, replies and likes of the account I am scraping, and I guess these parameters apply to the tweets which are scraped already. I tried to find a way to scrape the full activity of an account but it seems quite hard. For example, even by using the following code :

import snscrape.modules.twitter as sntwitter
import csv
csvFile = open('place_result.csv', 'a', newline='', encoding='utf8')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet',]) 
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:@username + since:2009-01-01 until:2020-11-05 -filter:links -filter:replies').get_items()):
    csvWriter.writerow([tweet.id, tweet.date, tweet.content])
csvFile.close()

I do not get the retweets / replies / likes made by the account. Only its own created tweets. Is there a way to scrape the whole thing ? Would you have a list of the additional parameters which I could add to the scraping ?
Also, I do have these Twitter Api keys, problem being that tweepy & twitter api only let me collect 3000 tweets maximum when scraping an account's timeline when I was using it in 2019. Is this still the case ?

@burakoglakci
Copy link

@DV777 Yes, the parameters attached to tweepy apply to tweets that have already been scraped.

On snscrape if you remove the filter:replies parameter, you can get answers. You can also collect retweets by removing the filter:links parameter. But mostly collects the links of the main tweet. I don't know if there's a way to get the number of likes with snscrape.

@sufyanhamid
Copy link

@burakoglakci is theier any way to find the longitude and latitude of tweets using snscrape!!

@elizabethsong
Copy link

I just used snscrape to get tweets for individual user accounts, filtering by like count. See code here: https://github.com/elizabethhh/Twitter-Data-Mining-Astro/blob/main/testastroold.py.

@DiameterEffect
Copy link

I just used snscrape to get tweets for individual user accounts, filtering by like count. See code here: https://github.com/elizabethhh/Twitter-Data-Mining-Astro/blob/main/testastroold.py.

can it get over 200k tweets?

@vinaigre552
Copy link

我只是使用snscrape来获取单个用户帐户的推文,并按计数进行过滤。在此处查看代码:https : //github.com/elizabethhh/Twitter-Data-Mining-Astro/blob/main/testastroold.py

Error retrieving https://twitter.com/search?f=live&lang=en&q=from%3A%40GeminiTerms+%2B+since%3A2015-12-02+until%3A2020-11-10-filter%3Areplies&src=spelling_expansion_revert_click: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='twitter.com', port=443): Read timed out. (read timeout=10)")), retrying
Have you encountered this problem?If so, how to solve it?

@MartinBeckUT
Copy link

I don't recommend using Tweepy with snscrape, it's not really efficient, you're basically scraping twice. When you scrape with snscrape there's a tweet object you can interact with that has a lot of information that will cover most use cases. I wouldn't recommend using tweepy's api.statuses_lookup unless you need specific information only offered through tweepy.

For those still unsure about using snscrape I did write an article for scraping with snscrape that I hope clears up any confusion about using that library, there's also python scripts and Jupyter notebooks I've created to build off of. I also have a picture in the article showing all the information accessible in snscrape's tweet object.
https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af

@Woolwit
Copy link

Woolwit commented Dec 3, 2020

Brilliant, thank you Martin!

@axaygaid
Copy link

axaygaid commented Dec 5, 2020

So anyway to get historical tweets for a hashtag ? like the most popular hashtag for the word ripple as an example from 2015 ?

Tweepy have a limit of one week depth, i tried GOT but i have the same issue as here (404) anyone has another solution to build a database from historical tweets ? :)

thanks !

@MartinBeckUT
Copy link

MartinBeckUT commented Dec 7, 2020

So anyway to get historical tweets for a hashtag ? like the most popular hashtag for the word ripple as an example from 2015 ?

Tweepy have a limit of one week depth, i tried GOT but i have the same issue as here (404) anyone has another solution to build a database from historical tweets ? :)

thanks !

Yes, refer to my article as I mentioned above where I cover the basics of using snscrape instead because GetOldTweets3 is basically obsolete due to changes in Twitter's API https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af

In regards to your specific use case, with snscrape you just put whatever query you want inside the quotes inside the TwitterSearchScraper method and adjust the since and until operators to whatever time range you'd want. I created a code snippet for you below. You can take out to i>500 if you don't want to restrict the amount of tweets you want but just want every single tweet.

import snscrape.modules.twitter as sntwitter
import pandas

tweets_list2 = []

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('#ripple since:2015-01-01 until:2016-01-01').get_items()):
    if i>500:
        break
    tweets_list2.append([tweet.date, tweet.id, tweet.content, tweet.user.username])
   
tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])

@axaygaid
Copy link

axaygaid commented Dec 7, 2020

So anyway to get historical tweets for a hashtag ? like the most popular hashtag for the word ripple as an example from 2015 ?
Tweepy have a limit of one week depth, i tried GOT but i have the same issue as here (404) anyone has another solution to build a database from historical tweets ? :)
thanks !

Yes, refer to my article as I mentioned above where I cover the basics of using snscrape instead because GetOldTweets3 is basically obsolete due to changes in Twitter's API https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af

In regards to your specific use case, with snscrape you just put whatever query you want inside the quotes inside the TwitterSearchScraper method and adjust the since and until operators to whatever time range you'd want. I created a code snippet for you below. You can take out to i>500 if you don't want to restrict the amount of tweets you want but just want every single tweet.

import snscrape.modules.twitter as sntwitter
import pandas

tweets_list2 = []

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('#ripple since:2015-01-01 until:2016-01-01').get_items()):
    if i>500:
        break
    tweets_list2.append([tweet.date, tweet.id, tweet.content, tweet.user.username])
   
tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])

Hello,

Thank's for your precious answer ! :) i tried your code and i still get a bug, but now it seems to be on my internet config ? do you have an idea to fix it ?

the error msg :
Error retrieving https://twitter.com/search?f=live&lang=en&q=%23ripple+since%3A2015-01-01+until%3A2016-01-01&src=spelling_expansion_revert_click: ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /search?f=live&lang=en&q=%23ripple+since%3A2015-01-01+until%3A2016-01-01&src=spelling_expansion_revert_click (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7ffb694b28d0>, 'Connection to twitter.com timed out. (connect timeout=10)'))")), retrying

Also when i tried this code on another laptop that works even if it's the same config

Thank's a lot !

@stefanocortinovis
Copy link

stefanocortinovis commented Dec 7, 2020

Hey! For the ones struggling to use snscrape, I put together a little library to download tweets using snscrape/tweepy according to customizable queries. Although it's still a work in progress, check this repo if you want to give it a try :)

@jis0324
Copy link

jis0324 commented Feb 8, 2022

Hello, There
So what is the final solution to not meet 404 error status?
Until yesterday, this page worked with python requests. but from today, it does not work for me and it returns 404 errror status.

import requests

headers = {
    'Connection': 'keep-alive',
    'rtt': '300',
    'downlink': '0.4',
    'ect': '3g',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Accept-Language': 'en-US,en;q=0.9,ko;q=0.8',
}

response = requests.get(
    'https://www.amazon.de/sp?marketplaceID=A1PA6795UKMFR9&seller=A135E02VGPPVQ&isAmazonFulfilled=1&ref=dp_merchant_link',
    headers=headers
)
print(response.status_code) # 404

I really will appreciate if I can get any help from you.
Regards.

@DiameterEffect
Copy link

Hey! For the ones struggling to use snscrape, I put together a little library to download tweets using snscrape/tweepy according to customizable queries. Although it's still a work in progress, check this repo if you want to give it a try :)
Hello, does this one get images and videos?

@jajalipiao
Copy link

I am having the same issue, does anyone have a solution for it?

@libbyseline
Copy link

I am having Twitter API errors today, though the usernames I'm searching for appear to be working. Any solutions? I work in R/rtweet, specifically using the tweetbotornot2 package.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.