中文说明 | English
Continuously maintained Sina Weibo crawler 🚀🚀🚀
UPDATE: The keyword search interface of weibo.cn has expired(2021.6.6)
The project has 2 branches to meet different needs:
Branch | Features | Magnitude of the crawled data |
---|---|---|
master | Single account, single IP, single machine | Hundreds of thousands |
senior | Account pool, IP pool, Docker | Hundreds of millions(Theoretical unlimited) |
- User Information
- Tweets post by user(all / specific period)
- Users' social relationships (fans/followers)
- Comments of tweets
- Tweets based on keywords and time period
- Retweets following a tweet
The spider based on the weibo.cn
, and the crawled fields are very rich. More detail:Data Structure Description
Note that the Python Version is Python3.6
git clone [email protected]:nghuyong/WeiboSpider.git --depth 1 --no-single-branch
cd WeiboSpider
pip install -r requirements.txt
In addition, you need to install mongodb.
Vist https://weibo.cn/
Log in, open the developer mode of the browser, and refresh again
Copy the cookie value in the network in the weibo.cn data packet.
Edit weibospider/settings.py
中:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0',
'Cookie':'SCF=AlvwCT3ltiVc36wsKpuvTV8uWF4V1tZ17ms9t-bZCAuiVJKpCsgvvmSdylNE6_4GbqwA_MWvxNgoc0Ks-qbZStc.; OUTFOX_SEARCH_USER_ID_NCOO=1258151803.428431; SUB=_2A25zjTjHDeRhGeBN6VUX9SvEzT-IHXVQjliPrDV6PUJbkdANLUvskW1NRJ24IEPNKfRaplNknl957NryzKEwBmhJ; SUHB=0ftpSdul-YZaMk; _T_WM=76982927613'
}
Replace the cookie field with your own cookie
If 403/302 appears on the crawler, it means that the account is blocked or the cookie is invalid
Rewrite the function fetch_proxy.
You can rewrite functions of start_requests
in ./weibospider/spiders/*
cd weibospider
python run_spider.py user
python run_spider.py fan
python run_spider.py follow
python run_spider.py comment
urls
select init_url_by_user_id()
in the function of start_requests
in ./weibospider/spiders/tweet.py
python run_spider.py tweet
urls
select init_url_by_user_id_and_date()
in the function of start_requests
in ./weibospider/spiders/tweet.py
python run_spider.py tweet
urls
select init_url_by_keywords_and_date()
in the function of start_requests
in ./weibospider/spiders/tweet.py
python run_spider.py tweet
python run_spider.py repost
Based on this project, I have crawled millions weibo active user data, and have built many weibo public opinion datasets: weibo-public-opinion-datasets.
If you have any problems in using the project, you can open an issue to discuss.
If you have good ideas on social media computing / public opinion analysis, feel free to email me: [email protected]