Cybersecurity Dataset

A dataset consisting of 4.3 million entries of Twitter, Blogs, Paper, and CVEs related to the cybersecurity domain

#Tokens	Min	Max	Sum	Median	Mean	Entries
Blogs	44	0.1M	169M	710	1.1k	151k
arXiv	533	0.7M	167M	8.2k	9.9k	16k
CVE	5	1.9k	12M	58	71	171k
Twitter	1	500	179M	39	45	4M
Total	1	0.7M	528M	40	122	4.3M

Table: Statistics of the number of tokens and entries of the dataset.

Blogs: 38 different blogs, like troyhunt.com, darkreading.com, schneier.com, and krebsonsecurity.com | Filtered duplicates, non-english texts and instances shorter than 300 characters | Extraction: trafilatura

arXiv: Papers from the category Cryptography and Security | Extraction: opendetex for papers in tex format and PyPDF2 for papers in pdf format | Filtered paper with lower length than 3000 characters

CVEs: NVD entries till 2022-03-15 21:38:23 | No filtering

Twitter: 1. Dataset: (infosec OR security OR threat OR vulnerability OR cyber OR cybersec OR infrasec OR netsec OR hacking OR siem OR soc OR offsec OR osing OR bugbounty) | 2. Dataset: Data breaches, as, for example, the Microsoft Exchange Server Data Breach

Usage

The dataset contains only the references to the data due to data storage/publication restrictions. The data instances can be gathered via the scripts in this repository or external tools.

Blog Crawling: blog_crawling.py -- Requirements: Installing requests pip install trafilatura

ArXiv Crawling: arxiv_crawling.py -- Requirements: Installing requests, PyPDF2, beautifulsoup4 pip install requests beautifulsoup4 PyPDF2

NVD Crawling: nvd_crawling.py -- Requirements: Installing requests pip install requests

Twitter Crawling: twitter_crawling.py -- Requirements: Installing Tweepy pip install tweepy | For gathering the Tweets you need to insert your API-Key, API-Secret-Key, Access-Token and Access-Token-Secret of the Twitter-API into the script before you run it.

Citation

When you use the dataset in your research, please cite:

Bayer, M., Kuehn, P., Shanehsaz, R., & Reuter, C. (2022). CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. ArXiv, abs/2212.02974.

https://arxiv.org/abs/2212.02974

@misc{bayer2022cysecbert,
      title={CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain}, 
      author={Markus Bayer and Philipp Kuehn and Ramin Shanehsaz and Christian Reuter},
      year={2022},
      eprint={2212.02974},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Language Model

This is the dataset constitutes the training dataset of the CySecBERT model.

Acknowledgements

This research work has been funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE and by the German Federal Ministry for Education and Research~(BMBF) in the project CYWARN~(13N15407). The calculations for this research were conducted on the Lichtenberg high performance computer of the TU Darmstadt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cybersecurity Dataset

A dataset consisting of 4.3 million entries of Twitter, Blogs, Paper, and CVEs related to the cybersecurity domain

Usage

Citation

Language Model

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cybersecurity Dataset

A dataset consisting of 4.3 million entries of Twitter, Blogs, Paper, and CVEs related to the cybersecurity domain

Usage

Citation

Language Model

Acknowledgements