Json sequence embedding

Given a JSON input such as:

{
  "input": {
    "title": "hello my name is",
    "subtitle": "another title",
    "address": {
      "road": "Corner Frome Road and, North Terrace",
      "city": "Adelaide",
      "state": "SA",
      "postcode": 5000
    }
  },
  "label": 1
}

To preprocess the input, the resnet key-value model preprocesses first into a list of flattened keys and values.

[
  { "title": "hello my name is" },
  { "subtitle": "another title" },
  { "address.road": "Corner Frome Road and, North Terrace" },
  { "address.city": "Adelaide" },
  { "address.state": "SA" },
  { "address.postcode": "5000" }
]

Then treats the whole key as a token, and tokenises each character individually.

tokens = ["title", "h", "e", "l", ..., "address.postcode", "5", "0", "0", "0"]

embedded_sequence: torch.LongTensor = tokeniser.convert_tokens_to_ids(tokens)

logits = key_value_resnet(embedded_sequence)

Pros

The model's vocabulary is setup to use all keys in the vocabulary + all printable runes from string.printable, this is a combination of digits, ascii_letters, punctuation, and whitespace.

Cons

Keys needs to have been added to the vocabulary of the model, any unknown keys will be assigned the "UNK" token

Usage

Json dataset

# first generate a schema for the model's vocabulary & the model's nn.Embed, we'll put the schema in the dataset folder
# this only needs to be run once
python -m src.dataset.json_files --train_data_path 'data' --test_data_path 'data' --schema_path 'data/schema.json' --write_to_path True

# then run the training script
python -m train --experiment_name "local_json_1d_resnet"

TODO

MVP
- start readme
- an example with homemade json dataset
- make a run_experiments/local_dataset_bench.sh
Extra
- look into pytorch-lightning-transformers?
- bench Yahoo! Answers because it has fast-text for comparison
- bench IMDB sequence classification problem make a run_experiments/imdb.sh and post results

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dockerfiles		dockerfiles
linters		linters
run_experiments		run_experiments
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Json sequence embedding

Pros

Cons

Usage

Json dataset

TODO

About

Releases

Packages

Languages

License

AdrianOrenstein/json-1d-resnet

Folders and files

Latest commit

History

Repository files navigation

Json sequence embedding

Pros

Cons

Usage

Json dataset

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages