Skip to content

Releases: echen102/COVID-19-TweetIDs

Release v2.6

30 Jul 07:05
Compare
Choose a tag to compare

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 7/24/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveillance 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v2.6)

Number of Tweets : 360,594,376

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 239,831,253 66.51%
Spanish es 45,999,640 12.76%
Portuguese pt 14,204,685 3.94%
Indonesian in 9,695,719 2.69%
Undefined und 8,955,350 2.48%
French fr 7,679,924 2.13%
Japanese ja 5,956,904 1.65%
Thai th 4,293,378 1.19%
Hindi hi 3,887,356 1.08%
Italian it 3,208,807 0.89%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v2.5

20 Jul 07:50
Compare
Choose a tag to compare

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 7/17/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v2.5)

Number of Tweets : 330,683,492

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 219,846,790 66.48%
Spanish es 41,951,902 12.69%
Portuguese pt 12,771,927 3.86%
Indonesian in 8,912,870 2.7%
Undefined und 8,127,946 2.46%
French fr 7,225,169 2.18%
Japanese ja 5,629,746 1.7%
Thai th 4,093,084 1.24%
Hindi hi 3,517,176 1.06%
Italian it 3,018,193 0.91%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v2.4

13 Jul 08:56
Compare
Choose a tag to compare

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 7/10/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v2.4)

Number of Tweets : 302,377,492

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 199,789,377 66.07%
Spanish es 38,477,687 12.73%
Portuguese pt 11,865,812 3.92%
Indonesian in 8,480,320 2.8%
Undefined und 7,354,640 2.43%
French fr 6,826,740 2.26%
Japanese ja 5,327,080 1.76%
Thai th 3,633,528 1.2%
Hindi hi 3,213,590 1.06%
Turkish tr 2,839,437 0.94%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v2.3

06 Jul 07:15
Compare
Choose a tag to compare

We have migrated our data collection to AWS, with upgraded computation and network specifications. This has enabled us to collect significantly more Tweets every hour, and the number of Tweet-IDs we will be uploading each week from release v2.0 onward will be greater than the number of Tweet-IDs we have been able to collect in previous releases. Please see our notes section in the README for further details.

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 7/03/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v2.3)

Number of Tweets : 272,346,129

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 178,749,702 65.63%
Spanish es 34,749,653 12.76%
Portuguese pt 10,497,310 3.85%
Indonesian in 7,936,066 2.91%
Undefined und 6,531,125 2.4%
French fr 6,368,869 2.34%
Japanese ja 5,018,438 1.84%
Thai th 3,508,870 1.29%
Hindi hi 2,971,606 1.09%
Turkish tr 2,690,310 0.99%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v2.2

29 Jun 07:34
Compare
Choose a tag to compare

We have migrated our data collection to AWS, with upgraded computation and network specifications. This has enabled us to collect significantly more Tweets every hour, and the number of Tweet-IDs we will be uploading each week from release v2.0 onward will be greater than the number of Tweet-IDs we have been able to collect in previous releases. Please see our notes section in the README for further details.

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 6/26/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v2.2)

Number of Tweets : 242,400,994

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 157,266,557 64.88%
Spanish es 31,245,760 12.89%
Portuguese pt 9,418,332 3.89%
Indonesian in 7,379,473 3.04%
French fr 5,970,986 2.46%
Undefined und 5,726,108 2.36%
Japanese ja 4,706,208 1.94%
Thai th 3,348,912 1.38%
Hindi hi 2,656,781 1.1%
Turkish tr 2,509,649 1.04%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v2.1

23 Jun 12:05
Compare
Choose a tag to compare

We have migrated our data collection to AWS, with upgraded computation and network specifications. This has enabled us to collect significantly more Tweets every hour, and the number of Tweet-IDs we will be uploading each week from release v2.0 onward will be greater than the number of Tweet-IDs we have been able to collect in previous releases. Please see our notes section in the README for further details.

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 6/19/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v2.1)

Number of Tweets : 212,978,935

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 137,344,992 64.48%
Spanish es 27,035,278 12.69%
Portuguese pt 8,193,574 3.85%
Indonesian in 6,777,050 3.18%
French fr 5,504,403 2.58%
(undefined) und 5,003,877 2.35%
Japanese ja 4,384,617 2.06%
Thai th 3,266,392 1.53%
Hindi hi 2,349,801 1.10%
Italian it 2,291,748 1.08%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v2.0

15 Jun 08:09
Compare
Choose a tag to compare

We have migrated our data collection to AWS, with upgraded computation and network specifications. This has enabled us to collect significantly more Tweets every hour, and the number of Tweet-IDs we will be uploading each week from release v2.0 onward will be greater than the number of Tweet-IDs we have been able to collect in previous releases. Please see our notes section in the README for further details.

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 6/12/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v2.0)

Number of Tweets : 183,011,739

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 117,868,338 64.40%
Spanish es 22,395,793 12.24%
Portuguese pt 6,900,098 3.77%
Indonesian in 6,124,708 3.35%
French fr 4,917,804 2.69%
(undefined) und 4,242,198 2.32%
Japanese ja 4,061,424 2.22%
Thai th 3,154,030 1.72%
Italian it 2,089,938 1.14%
Hindi hi 2,008,659 1.10%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v1.12

08 Jun 11:05
Compare
Choose a tag to compare

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 6/05/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v1.12)

Number of Tweets : 152, 862,137

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 99,753,283 65.26%
Spanish es 17,678,687 11.57%
Indonesian in 5,133,446 3.36%
Portuguese pt 4,850,362 3.17%
French fr 4,393,918 2.87%
Japanese ja 3,670,726 2.40%
(undefined) und 3,451,912 2.26%
Thai th 2,991,427 1.96%
Italian it 1,849,528 1.21%
Turkish tr 1,577,658 1.03%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v1.11

01 Jun 09:56
Compare
Choose a tag to compare

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 5/29/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
JMIR Public Health Surveill 2020;6(2):e19273
DOI: 10.2196/19273
PMID: 32427106

Statistics Summary (v1.11)

Number of Tweets : 144,747,801

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 94,369,403 65.20%
Spanish es 16,588,272 11.46%
Indonesian in 4,914,741 3.40%
Portuguese pt 4,522,335 3.12%
French fr 4,241,157 2.93%
Japanese ja 3,537,748 2.44%
(undefined) und 3,279,442 2.27%
Thai th 2,924,431 2.02%
Italian it 1,782,514 1.23%
Turkish tr 1,507,370 1.04%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Release v1.10

25 May 07:58
Compare
Choose a tag to compare

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

This release contains Tweet IDs collected from 1/21/20 - 5/22/20.

Please refer to the README for more details regarding data, data organization and data usage agreement.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service and cite the following manuscript:

Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020

Statistics Summary (v1.10)

Number of Tweets : 137,339,309

Language breakdown of top 10 most prevalent languages :

Language ISO No. tweets % total Tweets
English en 89,446,467 65.13%
Spanish es 15,651,897 11.40%
Indonesian in 4,703,023 3.42%
Portuguese pt 4,228,772 3.08%
French fr 4,100,588 2.99%
Japanese ja 3,365,432 2.45%
(undefined) und 3,094,598 2.25%
Thai th 2,887,217 2.10%
Italian it 1,726,624 1.26%
Turkish tr 1,452,652 1.06%

Known Gaps

Date Time
2/1/2020 4:00 - 9:00 UTC
2/8/2020 6:00 - 7:00 UTC
2/22/2020 21:00 - 24:00 UTC
2/23/2020 0:00 - 24:00 UTC
2/24/2020 0:00 - 4:00 UTC
2/25/2020 0:00 - 3:00 UTC
3/2/2020 Intermittent Internet Connectivity Issues
5/14/2020 7:00 - 8:00 UTC

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.