中文说明 | English

WeiboSpider

Continuously maintained Sina Weibo crawler 🚀🚀🚀

UPDATE: The keyword search interface of weibo.cn has expired（2021.6.6）

Introduction

Branches

The project has 2 branches to meet different needs:

Branch	Features	Magnitude of the crawled data
master	Single account, single IP, single machine	Hundreds of thousands
senior	Account pool, IP pool, Docker	Hundreds of millions(Theoretical unlimited)

Supported crawling types

User Information
Tweets post by user(all / specific period)
Users' social relationships (fans/followers)
Comments of tweets
Tweets based on keywords and time period
Retweets following a tweet

Data Structure

The spider based on the weibo.cn, and the crawled fields are very rich. More detail:Data Structure Description

Get Started

Pull the project && Install dependencies

Note that the Python Version is Python3.6

git clone git@github.com:nghuyong/WeiboSpider.git --depth 1 --no-single-branch
cd WeiboSpider
pip install -r requirements.txt

In addition, you need to install mongodb.

Replace Cookies

Vist https://weibo.cn/

Log in, open the developer mode of the browser, and refresh again

Copy the cookie value in the network in the weibo.cn data packet.

Edit weibospider/settings.py中:

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0',
    'Cookie':'SCF=AlvwCT3ltiVc36wsKpuvTV8uWF4V1tZ17ms9t-bZCAuiVJKpCsgvvmSdylNE6_4GbqwA_MWvxNgoc0Ks-qbZStc.; OUTFOX_SEARCH_USER_ID_NCOO=1258151803.428431; SUB=_2A25zjTjHDeRhGeBN6VUX9SvEzT-IHXVQjliPrDV6PUJbkdANLUvskW1NRJ24IEPNKfRaplNknl957NryzKEwBmhJ; SUHB=0ftpSdul-YZaMk; _T_WM=76982927613'
}

Replace the cookie field with your own cookie

If 403/302 appears on the crawler, it means that the account is blocked or the cookie is invalid

Add proxy IP (optional)

Rewrite the function fetch_proxy.

Run the program

You can rewrite functions of start_requests in ./weibospider/spiders/*

Crawl User Info

cd weibospider
python run_spider.py user

Crawl Fans List

python run_spider.py fan

Crawl Followers List

python run_spider.py follow

Crawl Comments of tweets

python run_spider.py comment

Crawl Tweets of Users(ALL)

urls select init_url_by_user_id() in the function of start_requests in ./weibospider/spiders/tweet.py

python run_spider.py tweet

Crawl Tweets of Users(Specific period)

urls select init_url_by_user_id_and_date() in the function of start_requests in ./weibospider/spiders/tweet.py

python run_spider.py tweet

Crawl Tweets of Specific Keywords and Time（Expired）

urls select init_url_by_keywords_and_date() in the function of start_requests in ./weibospider/spiders/tweet.py

python run_spider.py tweet

Crawl Retweet/Repost

python run_spider.py repost

Last But Not The Least

Based on this project, I have crawled millions weibo active user data, and have built many weibo public opinion datasets: weibo-public-opinion-datasets.

If you have any problems in using the project, you can open an issue to discuss.

If you have good ideas on social media computing / public opinion analysis, feel free to email me: nghuyong@163.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

WeiboSpider

Introduction

Branches

Supported crawling types

Data Structure

Get Started

Pull the project && Install dependencies

Replace Cookies

Add proxy IP (optional)

Run the program

Crawl User Info

Crawl Fans List

Crawl Followers List

Crawl Comments of tweets

Crawl Tweets of Users(ALL)

Crawl Tweets of Users(Specific period)

Crawl Tweets of Specific Keywords and Time（Expired）

Crawl Retweet/Repost

Last But Not The Least

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

WeiboSpider

Introduction

Branches

Supported crawling types

Data Structure

Get Started

Pull the project && Install dependencies

Replace Cookies

Add proxy IP (optional)

Run the program

Crawl User Info

Crawl Fans List

Crawl Followers List

Crawl Comments of tweets

Crawl Tweets of Users(ALL)

Crawl Tweets of Users(Specific period)

Crawl Tweets of Specific Keywords and Time（Expired）

Crawl Retweet/Repost

Last But Not The Least