Lexical simplification aims at replacing single words with their simpler synonyms. The MLSP shared tasks provides resources for lexical simplification across many languages. In this project, we use these resources to build our own pipelines. The project gives great opportunities to contribute to the latest research problems with your own ideas!
- Ganahl, Carolin
- Lindenbauer, Tobias
- Wiedmann, Luis
We will use the material provided by the MLSP Shared Task. Although the task is already over, you can still create your owm pipelines and compare them to winning models.
Have a look at their task description and provided material. Then come up with your own ideas how you could takle the problem of lexical simplification.
- Datasets are small and of mixed quality
- Literature reports results on distinct datasets and different metrics, making comparability difficult
- Proprietary models exhibit SOTA performance with simple prompt engineering techniques, but their cost and privacy issues make their use in public domain projects difficult
We have decided to focus on the last two pain points and build a comprehensive benchmarking suite to speed up research by reducing the need to re-implement boilerplate code.
This empowers researchers to easily benchmark their approaches on common datasets and metrics, enabling comparability with a wide range of literature.
Furthermore, we evaluate existing approaches on open-source models to enable transfer of SOTA techniques to production with Free Open-Source (FOS) models. We aim find a "working" approach that does not have SOTA performance but adds value to the community.
With our benchmarking suite in place, we investigate whether we can improve lexical simplification performance by ...
- Fine-tuning open-source models on lexical simplification datasets of other language datasets. We call this approach Leave-One-Out Cross-Validation (LOOCV) training.
- Enhancing the prompt with the topics or domain of the sentence to provide additional context.
- Evaluate the performance of GPT-4o, BERT-based models and Phi-3 on the lexical simplification task.
You can find the results of our experiments and the datasets in the data
directory.
In the notes
directory you can find our weekly meeting notes. The midterm presentation and the poster for the poster session are located in the presentation
directory.
All source code is located in the src
directory. The report, including the final presentation, is located in the report
directory.
We designed our system to run in Google Colab, to enable the use of a GPU to speed up inference. If you wish to test our system, we provide a main.py
script that demonstrates how to use the BenchmarkSuite
class to evaluate
a text-generation LLM using LLMLexicalSimplifier
wrapper. To run our system locally, change the paths to the datasets in the respective dataset classes to the local paths of the datasets.
Open the following Notebook in Colab, and follow its instructions to clone the project to your Google Drive:
Our system consists of three core components:
LexicalSimplifier
interfaceDataProvider
interfaceBenchmarkSuite
class
LexicalSimplifier
is an interface that defines the methods that a lexical simplifier must implement. It is used to abstract the implementation of the simplification model from the evaluation pipeline. This allows us to easily swap out different models and compare their performance.
DataProvider
is an interface that defines the methods that a data provider must implement. It is used to abstract the implementation of the data loading and preprocessing from the evaluation pipeline. This allows us to easily swap out different datasets and evaluate a models performance in a unified manner across all provided datasets.
BenchmarkSuite
is a class that is used to evaluate the performance of a lexical simplifier. The class constructor takes
- a
LexicalSimplifier
, thetestee_model
- a dictionary of
language_configurations
, the keys of which are the languages that the model is evaluated on (see theLanguage
class) and the value is a dictionary specifying the configuration. A configuration consists of apattern
which is used to generate aprompt
by theLexicalSimplifier
andexemplars
, which are prepended to the prompt forfew-shot
prompting. - a boolean flag
should_pass_topc
to toggle topic aware prompting. This is currently only supported for English.
To summarize, the BenchmarkSuite
class is the controller of our system and supports language specific configurations of prompts, including exemplars. Currently, we benchmark a model on available datasets of a language, if the language is present in the configuration.
Supported Languages and datasets are:
- English: BenchLS, LexMTurk, NNseval, TsarEN
- German: Germeval, Our new synthetic dataset
- Spanish: Alexsis
- Portuguese: PorSimplesSent
Adding a new dataset
To add a new dataset to our evaluation framework, simply implement the DataProvider
interface in your dataset by inheriting from it. Then, add the dataset to the _AVAILABLE_DATASETS
dictionary in the BenchmarkSuite
class. The evaluation pipeline will automatically pick up the new dataset and evaluate the model on it. If your dataset is of a new language,
you must also add the language to the Language
class and a new language to the _AVAILABLE_DATASETS
dictionary in the BenchmarkSuite
class.
Adding a new model wrapper
To add a new model wrapper to our evaluation framework, for example to benchmark your new, custom architecture on the provided datasets, simply inherit from the LexicalSimplifier
interface and implement the required methods. Then, pass an instance of your model to the BenchmarkSuite
class with your desired configuration and run the benchmark suite.
We provide a main.py
script that demonstrates how to use the BenchmarkSuite
class to evaluate a text-generation LM locally using our LLMLexicalSimplifier
wrapper. Note that if you want to evaluate a remote LLM, use the GPTLexicalSimplifier
wrapper instead.
Since the script runs locally, we use DistilGPT2 in our example, the performance of this model is quite poor and this is expected. For a more realistic evaluation, we recommend running the script in Google Colab, using a more potent model and a GPU.
The notebooks provided by us are for documentation purposes only in the context of the lab course that this project was developed. Except for the following notebooks, which also serve as demonstrations for potential users, we make no guarantee of the stability of the implementation:
topic_aware_substitutions.ipynb
: Demonstrates how to use the topic aware substitutions feature for English, also shows that our system is able to dynamically toggle topic awareness between English and other languages.evaluation_experimentation.ipynb
: Demonstrates how to use theBenchmarkSuite
class to evaluate the Phi-3 LLM and a fill-mask BERT model.
- GermanBERT
- DistilGermanBERT: Distilled version of the above model
- Distillbert multilingual ft DE (LeiKo)
- Distillbert multilingual ft EN, PT, ES
- Distillbert multilingual ft EN, DE, PT
- Distillbert multilingual ft EN, DE, ES
- Distillbert multilingual ft DE, PT, ES
Generative AI assistants (e.g., GitHub Copilot, JetBrains AI Assistant) were used in the development of this project. The use of AI assistants was limited to code completion and refactoring tasks. The AI assistants were not used to generate any core logic or algorithms in this project. The use of AI assistants was limited to the following tasks:
- Code completion
- Generation of documentation templates