A BERT-base model pre-trained on a huge corpus of computer networking text (~23Gb).
NetBERT demonstrate clear improvements over BERT on the following two representative text mining tasks:
- Computer Networking Text Classification (0.9% F1 improvement);
- Computer Networking Information Retrieval (12.3% improvement on a custom information retrieval score).
Additional experiments on Word Similarity and Word Analogy tend to show that NetBERT capture more meaningful semantic properties and relations between networking concepts than BERT does. For more information, you cand download my thesis.
You can use NetBERT with 🤗 transformers library as follows:
import torch
from transformers import BertTokenizer, BertForMaskedLM
# Load pretrained model and tokenizer
model = BertForMaskedLM.from_pretrained("antoiloui/netbert")
tokenizer = BertTokenizer.from_pretrained("antoiloui/netbert")
The computer networking corpus was collected by scraping all the text content from cisco.com. It resulted in about 30GB of uncleaned text, collected from 442,028 web pages in total. The pre-processing of the original corpus results in a cleaned dataset of about 170.7M sentences, for a total size of 22.7GB.
The following section describes how to run the cleaning scripts located in 'scripts/data_cleaning/'.
The following command clean a dataset of documents stored in a json file:
python cleanup_dataset.py --data_dir=<data_dir> --infile=<infile>
where --data_dir indicates the path of the repository containing the json files to clean, and --infile indicates the name of the json file to clean. Note that one can clean simultaneously all the json files present in /<data_dir> by running:
python cleanup_dataset.py --all=True
This script will clean the original dataset by:
- applying fix_text function from ftfy;
- replacing two or more spaces with one;
- removing sequences of special characters;
- removing small documents (less than 128 tokens);
- removing non-english documents with langdetect.
The following command presplit each document stored a json file into sentences:
python presplit_sentences_json.py --data_dir=<data_dir> --infile=<infile>
where --data_dir indicates the path of the repository containing the json files to presplit, and --infile indicates the name of the json file to presplit. Note that one can presplit simultaneously all the json files present in /<data_dir> by running:
python presplit_sentences_json.py --all=True
This script will pre-split each document in the given json file and perform additional cleaning on the individual sentences, namely:
- If sentence begins with a number, remove the number;
- If line begins with a unique special char, remove that char;
- Keep only sentences with more than 2 words and less than 200 words.
The following command create the train/dev/test data in json form:
python create_train_dev_test_json.py --input_files <in1> <in2> --output_dir=<output_dir> --test_percent <%_train> <%_dev> <%_test>
The following command convert the json file containing all documents in raw text:
python json2text.py --json_file=<json_file> --output_file=<output_file>
[coming up...]
[coming up...]
[coming up...]
[coming up...]
[coming up...]