-
Notifications
You must be signed in to change notification settings - Fork 2
labels
usage: acheron build label [-h] [--dry_run] [-c CORES] [-o OUT] -d DATASET [-m {MIC}] -n NAME --columns COLUMNS --key KEY -p PATH
labels can be created by declaring a column in a table (csv, excel, or pandas dataframe) or by using a prebuilt module. Current modules include {MIC}. For an example of what this table should look like, go to the bottom of this wiki page.
This is for using acheron supplied modules. Will apply some sort of filtering technique. If you are using custom labels and are not using a module, consider cleaning your labels and binning them into nearby groups. For example, if you have continuous numbers between 1-10, consider rounding to the nearest whole number.
This module bins MICs into appropriate classes, this keeps the model predicting on a set of discrete values instead of making the model think 16 mg/L
and == 16 mg/L
are different classes.
Currently support antimicrobials and their 3 letter codes are: {AMP: Ampicillin, AMC: Amoxicillin & Clavulanic acid, FOX: Cefoxitin, CRO: Ceftriaxone, TIO: Ceftiofur, GEN: Gentamicin, FIS: Sulfisoxazole, SXT: Trimethoprim & Sulfamethoxazole, AZM: Azithromycin, CHL: Chloroamphenicol, CIP: Ciprofloxacin, NAL: Nalidixic Acid, TET: Tetracycline}
If you are using antimicrobials not listed above, you need to add their class ranges into data/label_modules/mic/class_ranges.yaml
What you want to refer to these labels as, when you build a model later, you will need to pass this name in. If you are predicting MIC or SIR values, including these 3 letters in the name of your label will add major and very major error rates based on CLSI breakpoints to the generated summaries.
Which columns in the sheet to include. This can be a comma separated list on the command line like AMP_MIC,CIP_MIC or a path to a numpy array of names like data/mic_list.npy
Which columns contains the names of the sequences, THIS MUST MATCH THE WGS FILENAME without the extension. If your files are [seq_1.fasta, seq_2.fasta], then you need a column whose elements are seq_1 and seq_2
Example (using module): acheron build label --module MIC --name my_mics_1 --columns AMP_MIC,CIP_MIC --key names -p data/my_dataset_1/labels/all_metadata.csv -d my_dataset_1
Example: acheron build label --name my_SIRs --columns data/dataset/sirs_list.npy --key names -p data/my_dataset_1/labels/all_metadata.csv -d my_dataset_1
example table, can be placed anywhere but in this example, put in data/my_dataset_1/labels/all_metadata.csv
names | AMP_MIC | CIP_MIC | AMP_SIR | CIP_SIR |
---|---|---|---|---|
seq_1 | >= 32 | 16 | S | S |
seq_2 | 32 | 16 | R | S |
seq_3 | <=1 | 1 | R | I |