Skip to content
RylanSteinkey edited this page Jan 29, 2021 · 10 revisions

acheron build label

usage: acheron build label [-h] [--dry_run] [-c CORES] [-o OUT] -d DATASET [-m {MIC}] -n NAME --columns COLUMNS --key KEY -p PATH

labels can be created by declaring a column in a table (csv, excel, or pandas dataframe) or by using a prebuilt module. Current modules include {MIC}. For an example of what this table should look like, go to the bottom of this wiki page.

-m, --module

This is for using acheron supplied modules. Will apply some sort of filtering technique. If you are using custom labels and are not using a module, consider cleaning your labels and binning them into nearby groups. For example, if you have continuous numbers between 1-10, consider rounding to the nearest whole number.

MIC module

This module bins MICs into appropriate classes, this keeps the model predicting on a set of discrete values instead of making the model think 16 mg/L and == 16 mg/L are different classes.

Currently support antimicrobials and their 3 letter codes are: {AMP: Ampicillin, AMC: Amoxicillin & Clavulanic acid, FOX: Cefoxitin, CRO: Ceftriaxone, TIO: Ceftiofur, GEN: Gentamicin, FIS: Sulfisoxazole, SXT: Trimethoprim & Sulfamethoxazole, AZM: Azithromycin, CHL: Chloroamphenicol, CIP: Ciprofloxacin, NAL: Nalidixic Acid, TET: Tetracycline}

If you are using antimicrobials not listed above, you need to add their class ranges into data/label_modules/mic/class_ranges.yaml

-n, --name

What you want to refer to these labels as, when you build a model later, you will need to pass this name in. If you are predicting MIC or SIR values, including these 3 letters in the name of your label will add major and very major error rates based on CLSI breakpoints to the generated summaries.

--columns

Which columns in the sheet to include. This can be a comma separated list on the command line like AMP_MIC,CIP_MIC or a path to a numpy array of names like data/mic_list.npy

--key

Which columns contains the names of the sequences, THIS MUST MATCH THE WGS FILENAME without the extension. If your files are [seq_1.fasta, seq_2.fasta], then you need a column whose elements are seq_1 and seq_2

Example (using module): acheron build label --module MIC --name my_mics_1 --columns AMP_MIC,CIP_MIC --key names -p data/my_dataset_1/labels/all_metadata.csv -d my_dataset_1

Example: acheron build label --name my_SIRs --columns data/dataset/sirs_list.npy --key names -p data/my_dataset_1/labels/all_metadata.csv -d my_dataset_1

example table, can be placed anywhere but in this example, put in data/my_dataset_1/labels/all_metadata.csv

names AMP_MIC CIP_MIC AMP_SIR CIP_SIR
seq_1 >= 32 16 S S
seq_2 32 16 R S
seq_3 <=1 1 R I