Project Report
The project report (as a blogpost) can be found at project-spelling-error-detection.md
Steps to reproduce:
- Clone the Github / SVN repo (for people from Chair of Prof. Bast)
git clone https://github.com/georgestanley/Spelling-Error
svn co https://ad-svn.informatik.uni-freiburg.de/student-projects/stanley-george
-
Go to the respective working branch (Do this only if specifically mentioned by someone)
-
Create a docker image by the following steps:
-
Build the Image:
docker build -t stanley-george-project .
-
Run the container. Here we also have to mount the datasets folder if one is interested in training on the complete dataset.
docker run -it -p 8050:8050 --gpus=all --name stanley-george-project -v /nfs/students/stanley-george/data:/app/data/ stanley-george-project
Note: To remove an existing image with the same name, use the command:
docker rm stanley-george-project
-
-
Now that code and data is in place, let's try them:
-
Train a spell-classifier (from
application
folder).For E.g. To train the LSTM spell classifier with context and Semi-Character encoding, execute the command:
python -m application.lstm_spell_classifier_w_context --data_folder=data
A table of the different input parameters that can be provided are listed in Section xx
-
Evaluate of the test dataset.
Execute the same file as in step (i) except set the mode to
test
python -m application.lstm_spell_classifier_w_context --mode=test
-
Run all the Unit-tests.
Execute the shell script which triggers all unit-test files as:
./application/run_tests.sh
-
Run individual unit-tests.
All the uni-tests are in the application/tests folder. You can trigger them as below from the parent directory:
python -m application.tests.test_lstm_spell_classifier_w_context
-
-
Makefile
- Type
make
to see the possible options - Type
make start_app_console
to start the app in the console mode - Type
make start_app_webapp
to start the app as a Webapp made using Dash (Go to localhost:8050) - Type
make start_app_file_eval
to start the app and test a text file - Type
make run_unit_tests
to do PEP8 code checkstyle - Type
make checkstyle
to do PEP8 code checkstyle
- Type
Name | Datatype | Default Values Semi-Character with Context |
Default Values Semi-Character without Context |
Default Values One-hot encoded with context |
---|---|---|---|---|
data_folder | String | data | data | data |
output_root | String | results | results | results |
input_file | String | dev_10.jsonl | top_100_words.json | dev_10.jsonl |
val_file | String | 'bea60k.repaired.val/bea60_sentences_val_truth_and_false.json | bea60k.repaired.val/bea60_words_val_truth_and_false.json | bea60k.repaired.val/bea60_sentences_val_truth_and_false.json |
epochs | int | 10 | 10 | 10 |
lr | float | 0.001 | 0.01 | 0.001 |
bs | int | 1000 | 1000 | 32 |
hidden_dim | int | 100 | 100 | 100 |
hidden_layers | int | 2 | 2 | 2 |
max_len | int | NA | NA | 60 |
lower_case_mode | bool | False | NA | False |
mode | String | train | train | train |
eval_model_path | String | trained_models/semi_character_w_context.pth | trained_models/semi_character_wo_context.pth | trained_models/onehot_w_context.pth |
eval_file | String | bea60k.repaired.test/bea60_sentences_test_truth_and_false.json | bea60k.repaired.test//bea60_words_test_truth_and_false.json | bea60k.repaired.test//bea60_sentences_test_truth_and_false.json |
|-- Makefile
|-- README.md
|-- Dockerfile
|-- results
|-- runs
|-- application
| |-- Model.py
| |-- __init__.py
| |-- app.py
| |-- lstm_spell_classifier_w_context.py
| |-- lstm_spell_classifier_w_context_onehot.py
| |-- lstm_spell_classifier_wo_context.py
| |-- run_tests.sh
| |-- tests
| |-- utils
|-- bashrc
|-- data
|-- trained_models
`-- requirements.txt
Folder/file definitions
Folder | Description |
---|---|
results | The output folder for an experiment gets generated here and houses the saved models and log files |
runs | Contains the tensorboard metrics logs. |
application | Contains the important codes related to our application |
data | The data files needed which serve as input to our models |
trained_models | Contains the pretrained models which can be used for evaluation purpose. |