Thank you for choosing to contribute to AgML!
If you've found (or already have) a new dataset and you want to contribute the dataset to AgML, then the instructions below will help you format and the data to the AgML standard.
Currently, we have image classification, object detection, and semantic segmentation datasets available in AgML. These sources are synthesized to standard annotation formats, namely the following:
- Image Classification: Image-To-Label-Number
- Object Detection: COCO JSON
- Semantic Segmentation: Dense Pixel-Wise
Image classification datasets are organized in the following directory tree:
<dataset name>
├── <label 1>
│ ├── image1.png
│ ├── image2.png
│ └── image3.png
└── <label 2>
├── image1.png
├── image2.png
└── image3.png
The AgMLDataLoader
generates a mapping between each of the label names "label 1", "label 2", etc.,
and a numerical value.
Object detection datasets are constructed using COCO JSON formatting. For a general overview, see https://cocodataset.org/#format-data. Another good resource is https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/cd-transform-coco.html. Once you have the images and the bounding box annotations, this involves generating a dictionary with four keys:
images
: A list of dictionaries with the following items:- The image file name (without the parent directory!) in
file_name
- The ID (a unique number, usually from 1 to num_images) in
id
, - The height/width of the image in
height
andwidth
, respectively.
- The image file name (without the parent directory!) in
annotations
: A list of dictionaries with each dictionary representing a unique bounding box (do not stack multiple bounding boxes into a single dictionary, even if they are for the same image!), and containing:- The area of the bounding box in
area
. - The bounding box itself in
bbox
. Note: The bounding box should have four coordinates. The first two are the x, y of the top-left corner of the bounding box, the other two are its height and width. - The class label (numerical) of the image in
category_id
. - The ID (NOT the filename) of the image it corresponds to in
image_id
. - The ID of the bounding box in
id
. For instance, if a unique image has six corresponding bounding boxes, then each of them would be given anid
from 1-6. iscrowd
should be set to 0 by default, unless the dataset explicitly comes withiscrowd
as 1.ignore
should be 0 by default.segmentation
only applies for instance segmentation datasets. If converting an instance segmentation dataset to object detection, you can leave the polygonal segmentation as is. Otherwise, put this as an empty list.
- The area of the bounding box in
category
: A list of dictionaries with each category, where each of these dictionaries contains:- The human-readable name of the class (e.g., "strawberry") in
name
. - The supercategory of the class, if there are nested classes, in
supercategory
. Otherwise, just leave this as the string"none"
. - The numerical ID of the class in
id
.
- The human-readable name of the class (e.g., "strawberry") in
info
: A single dictionary with metadata and information about the dataset:description
: A basic description of the dataset.url
: The URL from which the dataset was acquired.version
: The dataset version. Set to1.0
if unknown.year
: The year in which the dataset was released.contributor
: The author(s) of the dataset.date_created
: The date when the dataset was published. Give an approximate year if unknown.
The dictionary containing this information should be written to a file called annotations.json
, and the file structure will be:
<dataset name>
├── annotations.json
└── images
├── image1.png
├── image2.png
└── image3.png
Semantic segmentation datasets are constructed using pixel-wise annotation masks. Each image in the dataset has a corresponding annotation mask. These masks have the following properties:
- Two-dimensional, so no channel shape. Their complete shape will be
(image_height, image_width)
. - Each of the pixels will be a numerical class label or
0
for background.
The directory tree should look like follows:
<dataset name>
├── annotations
│ ├── mask1.png
│ ├── mask2.png
│ └── mask3.png
└── images
├── image1.png
├── image2.png
└── image3.png
If you've found a new dataset that isn't already being used in AgML and you want to add it, there's a few things you need to do.
Any preprocessing code being used for the dataset can be kept in agml/_internal/preprocess.py
, by adding an elif
statement
to the preprocess()
method with the dataset name. If there is no preprocessing code, then just put a pass
statement in the block.
- Make sure each image is in the range of 0-255 in integers as opposed to 0-1 as floats. This will prevent any loss of data that could adversely affect training.
- For a semantic segmentation dataset, put the masks in a
png
format as opposed tojpg
or other.
After processing and standardizing the dataset, make sure that the dataset is organized in one of the formats above, and then go to the parent directory
of the directory of the dataset (for example, if the dataset is in /root/my_new_dataset
, go to /root
). Then run the following command:
zip -r my_new_dataset.zip my_new_dataset -x ".*"
If running on MacOS, use the following command:
zip -r my_new_dataset.zip my_new_dataset -x ".*" -x "__MACOSX"
Next, you need to update the public_datasources.json
and source_citations.json
files. These two can be found
in the agml/_assets
folder. You will need to update the public_datasources.json
file in the following way:
"my_new_dataset": {
"classes": {
"1": "class_1",
"2": "class_2",
"3": "class_3"
},
"ml_task": "See the table for the different dataset types.",
"ag_task": "The agricultural task that is associated with the dataset.",
"location": {
"continent": "The continent the dataset was collected on.",
"country": "The country the dataset was collected in."
},
"sensor_modality": "Usually rgb, but can include other image modalities.",
"real_synthetic": "Are the images real or synthetically generated?",
"platform": "handheld or ground",
"input_data_format": "See the table for the different dataset types.",
"annotation_format": "See the table for the different dataset types.",
"n_images": "The total number of images in the dataset.",
"docs_url": "Where can the user find the most clear information about the dataset?"
}
Note: If the dataset is captured in multiple countries or you don't know where it is from, then put "worldwide" for both "continent" and "country".
Dataset Format | ml_task |
annotation_format |
---|---|---|
Image Classification | image_classification |
directory_names |
Object Detection | object_detection |
coco_json |
Semantic Segmentation | semantic_segmentation |
image |
The source_citations.json
file should be updated this way:
"my_new_dataset": {
"license": "The license being used by the dataset.",
"citation": "The paper/library to cite for the dataset."
}
If the dataset has no license or has no citation, leave the corresponding lines blank.
Once you've readied the dataset, create a new pull request on the AgML repository. We will then review the changes and review next steps for adding the dataset into AgML's public data storage.
Install uv follow the guidelines in https://docs.astral.sh/uv/getting-started/installation/, it is recommended to use the standalone installation.
The build the associated wheels simply run:
uv build
To sync the dependencies simply run:
uv sync
For running scripts or one using the project's e
uv run python <script>