Skip to content

BCDH/cadet

Repository files navigation

Cadet: Asset Management for spaCy Language Models

What is Cadet?

Cadet is a web app for creating custom language objects for spaCy.

  • Goal: To provide an easy-to-use tool that enables non-technical users to start leveraging the power of natural language processing (NLP) in their research projects.
  • Context: CLS Infra + DARIAH-Princeton Workshop Series "NLP 4 New Languages" (funded by the NEH)

New Languages for spaCy?

  • before you can train your model on annotated data, you need some data to begin with
  • spaCy language object contains multiple linguistic assets, not just an annotated corpus
  • spaCy offers models for many languages, but starting from scratch is not easy

Скриншот 2019-11-20 19.48.27

Why Cadet?

  • Accessibility: Makes the collection and processing of langauge assets accessible to humanists without a background in programming or data science.
  • Customization: Allows users to tailor language data to their specific needs and research domains.
  • Efficiency: Streamlines the process of creating amd processing language assets for new spaCy language models

Two flavors of Cadet

  • Stand-alone web app: User-friendly GUI with an intuitive design that simplifies model creation and customization.
  • Jupyter Notebook: More flexible than the stand-alone web app but requires a knowledge of Python

How does it work?

  • it takes the user through seven individual steps

1. Create a New Language Object

Building from spaCy's defaults, this will create a new language object for your language

2. Provide example sentences

3. Tokenization Check

4. Lookup Tables

5. Load texts for annotation

6. Frequent Tokens

Overview

Bulk Editing

7. Generate CONLL-U Files for Export to Inception

8. Export model for training

Install and run with Docker

  1. Make sure you have docker installed on your machine (including the docker command).
  2. After cloning this repository, navigate to the root of the repository For example:
git clone [email protected]:BCDH/cadet.git
cd cadet
  1. Build the Docker image
docker build -t cadet .
  1. Run the Docker Container
docker run -p 8000:8000 cadet

Repo template

How to use this template

  1. Click on the green button "Use this template"

  2. Create a new repository for your app. The name is entirely up to you.

  3. When you application is working and ready to deploy, type the following in your browser:

    https://heroku.com/deploy?template=https://github.com/<your git account>/<your repo>/tree/master

Please note that you will be prompted to create a Hiroku user account if you do not have one.

Deploy

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 101004984: CLS INFRA as well as the National Endownment for the Humanities via New Languages for NLP