GEITje is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been (further) trained on Dutch financial texts. As a result, it has learned better Dutch and has more knowledge about Dutch financial topics.
-
Install Dependencies: Run
poetry install
. FinGEITje uses Poetry as a dependency manager. Running this command will create a virtual environment and install the necessary Python packages. -
Download the Original Dataset: Run data_downloader to download the original dataset.
-
Translate the Dataset: Run translation_service to translate the original dataset.
-
Post-Process the Translated Dataset: Run post_process to post-process the translated dataset.
-
Format the Translation: Run translation_formatter to format the translation into the original dataset format.
-
Training Configuration: The training configuration can be found in config_qlora. This is a recipe as described in the Alignment Handbook, and we used the Alignment Handbook to spawn the whole training pipeline.
-
Evaluation Package: The evaluation package can be found in evaluation. To evaluate the model, a set of metrics are defined per task. The tasks can be grouped per dataset so every dataset is evaluated separately.
-
data_processing.ipynb: Contains the code that shows the exact translations that can be done before passing them to our translation service.
-
combine_datasets.ipynb: Contains the code that shows how the translated instruction tuning data is filtered by a Dutch language check and some predefined checks, and combined into one instruction tuning dataset.
-
evaluation_nl.ipynb: Contains the evaluation of snoels/FinGEITje-7B-sft on the Dutch financial benchmark snoels/FinDutchBench with and without automated answer extraction.
-
evaluation_en.ipynb: Contains the evaluation of snoels/FinGEITje-7B-sft on the English financial benchmark with and without automated answer extraction.
This repository is based on the following paper:
A Dutch Financial Large Language Model
Link to the paper
If you use FinGEITje in your work, please cite:
@article{FinGEITje2024,
title={A Dutch Financial Large Language Model},
author={Noels, Sander and De Blaere, Jorne and De Bie, Tijl},
journal={arXiv preprint arXiv:2410.12835},
year={2024},
url={https://arxiv.org/abs/2410.12835}
}
Contributions are welcome! If you have ideas, suggestions, or find issues, please open an issue or submit a pull request. We appreciate your help in improving FinGEITje.
We would like to thank:
-
Rijgersberg (GitHub) for creating one of the first Dutch foundation models called GEITje: a Dutch large open language model with 7 billion parameters, based on Mistral 7B. It has been further trained on 10 billion tokens of Dutch text. The model can be found here: Rijgersberg/GEITje-7B.
-
Bram Vanroy (GitHub) for creating one of the first Dutch open source chat models, GEITje-7B-ultra, and for open-sourcing its training, translation (dutch-instruction-datasets), and evaluation details.
-
Silverfin for their collaboration in this research. Silverfin, a Belgian scale-up focused on building an accountancy cloud service, provided valuable insights and resources that were instrumental in the development of FinGEITje. More about their work can be found at Silverfin.
We also extend our gratitude to the contributors of the Alignment Handbook for providing valuable resources that aided in the development of FinGEITje.
This project is licensed under the Apache License 2.0.
For any inquiries or questions, please contact Sander Noels.