Thank you for your interest in contributing to ChemNLP! There are many ways to contribute, including implementing datasets, improving code, and enhancing documentation.
- Create a GitHub account if you don't have one.
- Fork the ChemNLP repository.
- Clone your fork.
- Create a new branch for your contribution.
- Set up your development environment as described in the
Installation and set-up
section of README.md.
One of the most valuable contributions is implementing a dataset. Here's how to do it:
-
Choose a dataset from our awesome list or add a new one there.
-
Create an issue in this repository stating your intention to add the dataset.
-
Make a Pull Request (PR) that adds a new folder in
data
with the following files:meta.yaml
: Describes the dataset (see structure below).transform.py
: Python code to transform the original dataset into a usable form.
name: dataset_name
description: Short description of the dataset
targets:
- id: target_name
description: Description of the target
units: Units of the target (if applicable)
type: continuous or boolean
names:
- noun: target noun
- adjective: target adjective
benchmarks:
- name: benchmark_name
link: benchmark_link
split_column: split
identifiers:
- id: identifier_name
type: SMILES, InChI, etc.
description: Description of the identifier
license: Dataset license
num_points: Number of datapoints
links:
- name: link_name
url: link_url
description: Link description
bibtex: Citation in BibTeX format
- Download data from an official source or upload it to a repository and retrieve it from there.
- For tabular datasets: remove/merge duplicates, rename columns, and drop unused columns.
- Output should be as lean as possible, typically in a
data_clean.csv
file. - Add any necessary dependencies to
dev-requirements.txt
orrequirements.txt
.
Text templates are used for sampling and can utilize data from meta.yaml
, recode categorical data, and chain multiple data fields. Examples include:
-
Basic template:
The molecule with {SMILES__description} {SMILES#} has {property#} {property__units}.
-
Multiple choice template:
Task: Answer the multiple choice question. Question: Is the molecule with {SMILES__description} {SMILES#} {property__names__adjective}? Options: {%multiple_choice_enum%2%aA1} {property%} Answer: {%multiple_choice_result}
-
Benchmarking template:
Is the molecule with {SMILES__description} {SMILES#} {property__names__adjective}?<EOI>{property#yes&no}
- Ensure your code passes all existing tests.
- Add new tests for any new functionality you introduce.
- Run
pytest
to check all tests pass.
- Commit your changes using conventional commits.
- Push your changes to your fork.
- Create a Pull Request to the main ChemNLP repository.
- Respond to any feedback on your PR.
Thank you for contributing to ChemNLP! Your efforts help advance chemical natural language processing research and applications.