GitHub

Multilingual Lexical Normalization

This is the repository belonging to the shared task on multilingual lexical normalization at WNUT 2021. More info can be found on noisy-text.github.io/2021/multi-lexnorm.html

This repository contains all data pre-processed in the same format:

Most	most
social	social
pple	people
r	are
troublesome	troublesome

Some of the languages include annotation for word splits and merges. When a word is split, the normalization column include a white-space character, and with a merge the normalization is only included for the first word:

if      If
i       i
have    have
a       a
head    headache
ache
tomorro tomorrow
ima     i'm going to
be      be
pissed  pissed

To read this data properly in python3, do not use .strip().split(), but use .strip('\n').split('\t') instead.

contents:

data: contains the data per language. Note that development data is only included for the larger sets, for the smaller ones we suggest to use k-fold.
scripts: baseline and evaluation scripts.

For any problems with the data (annotation), please post on https://groups.google.com/u/2/g/multilexnorm.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
exptnbs		exptnbs
multilexnorm @ e92e5b8		multilexnorm @ e92e5b8
.black.cfg		.black.cfg
.flake8.cfg		.flake8.cfg
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.mypy.cfg		.mypy.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Lexical Normalization

About

Releases

Packages

Contributors 3

Languages

pkolachi/lexicalnormalization

Folders and files

Latest commit

History

Repository files navigation

Multilingual Lexical Normalization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages