NetBERT 📶

A BERT-base model pre-trained on a huge corpus of computer networking text (~23Gb).

NetBERT demonstrate clear improvements over BERT on the following two representative text mining tasks:

Computer Networking Text Classification (0.9% F1 improvement);
Computer Networking Information Retrieval (12.3% improvement on a custom information retrieval score).

Additional experiments on Word Similarity and Word Analogy tend to show that NetBERT capture more meaningful semantic properties and relations between networking concepts than BERT does. For more information, you cand download my thesis.

1. Usage

You can use NetBERT with 🤗 transformers library as follows:

import torch
from transformers import BertTokenizer, BertForMaskedLM

# Load pretrained model and tokenizer
model = BertForMaskedLM.from_pretrained("antoiloui/netbert")
tokenizer = BertTokenizer.from_pretrained("antoiloui/netbert")

2. Data Cleaning

Data

The computer networking corpus was collected by scraping all the text content from cisco.com. It resulted in about 30GB of uncleaned text, collected from 442,028 web pages in total. The pre-processing of the original corpus results in a cleaned dataset of about 170.7M sentences, for a total size of 22.7GB.

The following section describes how to run the cleaning scripts located in 'scripts/data_cleaning/'.

(a) Clean dataset

The following command clean a dataset of documents stored in a json file:

python cleanup_dataset.py --data_dir=<data_dir> --infile=<infile>

where --data_dir indicates the path of the repository containing the json files to clean, and --infile indicates the name of the json file to clean. Note that one can clean simultaneously all the json files present in /<data_dir> by running:

python cleanup_dataset.py --all=True

This script will clean the original dataset by:

applying fix_text function from ftfy;
replacing two or more spaces with one;
removing sequences of special characters;
removing small documents (less than 128 tokens);
removing non-english documents with langdetect.

(b) Presplit sentences

The following command presplit each document stored a json file into sentences:

python presplit_sentences_json.py --data_dir=<data_dir> --infile=<infile>

where --data_dir indicates the path of the repository containing the json files to presplit, and --infile indicates the name of the json file to presplit. Note that one can presplit simultaneously all the json files present in /<data_dir> by running:

python presplit_sentences_json.py --all=True

This script will pre-split each document in the given json file and perform additional cleaning on the individual sentences, namely:

If sentence begins with a number, remove the number;
If line begins with a unique special char, remove that char;
Keep only sentences with more than 2 words and less than 200 words.

(c) Create train/dev/test data

The following command create the train/dev/test data in json form:

python create_train_dev_test_json.py --input_files <in1> <in2> --output_dir=<output_dir> --test_percent <%_train> <%_dev> <%_test>

(d) Export json documents to raw text

The following command convert the json file containing all documents in raw text:

python json2text.py --json_file=<json_file> --output_file=<output_file>

3. Pre-training

[coming up...]

4. Tasks

3.1. Text Classification

[coming up...]

3.2. Information Retrieval

[coming up...]

3.3. Word Similarity

[coming up...]

3.4. Word Analogy

[coming up...]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

NetBERT 📶

Table of contents

1. Usage

2. Data Cleaning

Data

(a) Clean dataset

(b) Presplit sentences

(c) Create train/dev/test data

(d) Export json documents to raw text

3. Pre-training

4. Tasks

3.1. Text Classification

3.2. Information Retrieval

3.3. Word Similarity

3.4. Word Analogy

Files

index.md

Latest commit

History

index.md

File metadata and controls

NetBERT 📶

Table of contents

1. Usage

2. Data Cleaning

Data

(a) Clean dataset

(b) Presplit sentences

(c) Create train/dev/test data

(d) Export json documents to raw text

3. Pre-training

4. Tasks

3.1. Text Classification

3.2. Information Retrieval

3.3. Word Similarity

3.4. Word Analogy