Given a JSON input such as:
{
"input": {
"title": "hello my name is",
"subtitle": "another title",
"address": {
"road": "Corner Frome Road and, North Terrace",
"city": "Adelaide",
"state": "SA",
"postcode": 5000
}
},
"label": 1
}
To preprocess the input, the resnet key-value model preprocesses first into a list of flattened keys and values.
[
{ "title": "hello my name is" },
{ "subtitle": "another title" },
{ "address.road": "Corner Frome Road and, North Terrace" },
{ "address.city": "Adelaide" },
{ "address.state": "SA" },
{ "address.postcode": "5000" }
]
Then treats the whole key as a token, and tokenises each character individually.
tokens = ["title", "h", "e", "l", ..., "address.postcode", "5", "0", "0", "0"]
embedded_sequence: torch.LongTensor = tokeniser.convert_tokens_to_ids(tokens)
logits = key_value_resnet(embedded_sequence)
- The model's vocabulary is setup to use all keys in the vocabulary + all printable runes from string.printable, this is a combination of digits, ascii_letters, punctuation, and whitespace.
- Keys needs to have been added to the vocabulary of the model, any unknown keys will be assigned the "UNK" token
# first generate a schema for the model's vocabulary & the model's nn.Embed, we'll put the schema in the dataset folder
# this only needs to be run once
python -m src.dataset.json_files --train_data_path 'data' --test_data_path 'data' --schema_path 'data/schema.json' --write_to_path True
# then run the training script
python -m train --experiment_name "local_json_1d_resnet"
-
MVP
- start readme
- an example with homemade json dataset
- make a
run_experiments/local_dataset_bench.sh
-
Extra
- look into pytorch-lightning-transformers?
- bench Yahoo! Answers because it has fast-text for comparison
- bench IMDB sequence classification problem make a
run_experiments/imdb.sh
and post results