Cadet is a web app for creating custom language objects for spaCy.
- Goal: To provide an easy-to-use tool that enables non-technical users to start leveraging the power of natural language processing (NLP) in their research projects.
- Context: CLS Infra + DARIAH-Princeton Workshop Series "NLP 4 New Languages" (funded by the NEH)
- before you can train your model on annotated data, you need some data to begin with
- spaCy language object contains multiple linguistic assets, not just an annotated corpus
- spaCy offers models for many languages, but starting from scratch is not easy
- Accessibility: Makes the collection and processing of langauge assets accessible to humanists without a background in programming or data science.
- Customization: Allows users to tailor language data to their specific needs and research domains.
- Efficiency: Streamlines the process of creating amd processing language assets for new spaCy language models
- Stand-alone web app: User-friendly GUI with an intuitive design that simplifies model creation and customization.
- Jupyter Notebook: More flexible than the stand-alone web app but requires a knowledge of Python
- it takes the user through seven individual steps
Building from spaCy's defaults, this will create a new language object for your language
- Make sure you have docker installed on your machine (including the
docker
command). - After cloning this repository, navigate to the root of the repository For example:
git clone [email protected]:BCDH/cadet.git
cd cadet
- Build the Docker image
docker build -t cadet .
- Run the Docker Container
docker run -p 8000:8000 cadet
-
Create a new repository for your app. The name is entirely up to you.
-
When you application is working and ready to deploy, type the following in your browser:
https://heroku.com/deploy?template=https://github.com/<your git account>/<your repo>/tree/master
Please note that you will be prompted to create a Hiroku user account if you do not have one.
This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 101004984: CLS INFRA as well as the National Endownment for the Humanities via New Languages for NLP