features

`acheron build feature`

usage: acheron build feature [-h] [--dry_run] [-c CORES] [-o OUT] -d DATASET -t {kmer,genes,abricate,omnilog} [-k KMER_LENGTH] [-db {AMR,VF}]

-t, --type

Current support types are kmer, genes, abricate, omnilog

kmer

For kmers of length 11 or shorter, all possible strings of length k are computed and this master list is to used to populate a pandas dataframe. For kmers of length 13 or larger, acheron will scan for which kmers are seen, and use this list. This takes more time and can be incredibly memory expensive. This process will distribute tasks taking more then 1TB of ram over multiple nodes using the slurm workload manager for supercomputing cluster users. Please avoid kmers of odd numbers, as some programs will not allow them as they can cause palindromic self loops.

11-mer Example: acheron build feature -c 16 -t kmer -k 11 -d dataset_name_1

31-mer Example: acheron build feature -c 8 -t kmer -k 31 -d dataset_name_1

There is also support for slurm (supercomputing cluster), but it has only been tested on the Public Health Agency of Canada's cluster.

11-mer Example: acheron build feature -c 144 -t kmer -k 11 -d dataset_name_1 --cluster slurm

31-mer Example: acheron build feature -c 16 -t kmer -k 31 -d dataset_name_1 --cluster slurm

Using more cores will mean more is happening at the same time, therefore using more RAM. If you get 'out of RAM' errors, just lower the -c (number of cores) variable.

genes

This uses all genes, based on presence/absence

Example: acheron build feature -t genes -d dataset_name_1

abricate

Uses AMR or VF genes only, must specify which using -db

Example acheron build feature -t abricate -db AMR -d dataset_name_1

omnilog

Requires data to be structure in a table (csv, excel, pandas dataframe) with columns labeled as substrates, rows labeled as samples, and each cell being area under the curve. This must be placed in data/dataset_name_1/omnilog/omnilog.{ext}.

Example acheron build feature -t omnilog -d dataset_name_1

Prefiltering

Prefiltering is a process where k-mers are filtered as they are read, prior to being placed into a matrix, to save on RAM. For example, a 31-mer matrix of 6328 sequences uses 486 GiB, this could be way too much to handle, depending on the next step in the pipeline.

Prefiltering will take a peek at the labels and determine which features have low variance and will therefore be useless to the model when the time for training comes. This will filter the matrix from 72,504,712 features down to 10 million.

This has the advantage of using ~14% of the RAM but the disadvantage of needing to store a copy of the matrix according to each attribute. In other words, instead of a single 486 GiB, you get 15x 67 GiB matrices.

Summary: Prefiltering uses 2x more disk space, but 7x less RAM.