The goal of this demo is get yourself familiar with the Fondant framework.
In this demo, we'll be building a simple data preparation pipeline for a code assistant (like StarCoder or Github CoPilot).
The pipeline consists of 3 components. We already implemented 2 of them, it's up to you to implement the one in the middle.
The first component loads a code dataset from the 🤗 hub, the next component filters this dataset (based on a metric of your choice, like number of lines of code) and the final component removes PII (Personal Identifiable Information) from the code.
This way, we end up with a higher quality, anonymized dataset on which we can train a language model.
This will install Fondant from PyPI
pip install -r requirements.txt
Start the Jupyter notebook server
jupyter notebook
We have provided some boilerplate for you to get started in components/custom_component
.
- The name of the directory originally called
custom_component
. The build script will automatically use this to name the image of your component - The
CustomComponent
class incomponents/custom_component/src/main.py
- In
components/custom_component/fondant_component.yaml
Just as we develop our APIs spec-first, you should do the same for your Fondant component.
Head over to components/custom_component/fondant_component.yaml
and specify:
- Which fields of the data you want to access
- Which arguments your component accepts
Fondant will use this information to call your component's transform
method
Implement your component by implementing the transform
method in components/custom_component/src/main.py
.
We provided a small script to test your component locally. In the future, Fondant will provide this functionality built-in.
Go to your component directory
cd components/custom_component
Install the requirements.txt
pip install -r requirements.txt
And run the script
./run_locally.sh
You will have to add any arguments your component takes to the script as well.
Before you can use your component in your pipeline, you need to build its docker image
If you're not there yet, go to your component directory
cd components/custom_component
And run the script
./build_image.sh
Now you can add your component to your pipeline in pipeline.py
Make sure to update your pipeline name!
You can now run your pipeline by moving back to the root directory and running:
python pipeline.py
You should see a url in your terminal which brings you to your running pipeline.
Go back to your notebook to validate the results from your pipeline