-
Notifications
You must be signed in to change notification settings - Fork 2
features
usage: acheron build feature [-h] [--dry_run] [-c CORES] [-o OUT] -d DATASET -t {kmer,genes,abricate,omnilog} [-k KMER_LENGTH] [-db {AMR,VF}]
Current support types are kmer, genes, abricate, omnilog
For kmers of length 11 or shorter, all possible strings of length k are computed and this master list is to used to populate a pandas dataframe. For kmers of length 13 or larger, acheron will scan for which kmers are seen, and use this list. This takes more time and can be incredibly memory expensive. This process will distribute tasks taking more then 1TB of ram over multiple nodes using the slurm workload manager for supercomputing cluster users. Please avoid kmers of odd numbers, as some programs will not allow them as they can cause palindromic self loops.
11-mer Example: acheron build feature -c 16 -t kmer -k 11 -d dataset_name_1
31-mer Example: acheron build feature -c 8 -t kmer -k 31 -d dataset_name_1
There is also support for slurm (supercomputing cluster), but it has only been tested on the Public Health Agency of Canada's cluster.
11-mer Example: acheron build feature -c 144 -t kmer -k 11 -d dataset_name_1 --cluster slurm
31-mer Example: acheron build feature -c 16 -t kmer -k 31 -d dataset_name_1 --cluster slurm
Using more cores will mean more is happening at the same time, therefore using more RAM. If you get 'out of RAM' errors, just lower the -c (number of cores) variable.
This uses all genes, based on presence/absence
Example: acheron build feature -t genes -d dataset_name_1
Uses AMR or VF genes only, must specify which using -db
Example acheron build feature -t abricate -db AMR -d dataset_name_1
Requires data to be structure in a table (csv, excel, pandas dataframe) with columns labeled as substrates, rows labeled as samples, and each cell being area under the curve. This must be placed in data/dataset_name_1/omnilog/omnilog.{ext}.
Example acheron build feature -t omnilog -d dataset_name_1
Prefiltering is a process where k-mers are filtered as they are read, prior to being placed into a matrix, to save on RAM. For example, a 31-mer matrix of 6328 sequences uses 486 GiB, this could be way too much to handle, depending on the next step in the pipeline.
Prefiltering will take a peek at the labels and determine which features have low variance and will therefore be useless to the model when the time for training comes. This will filter the matrix from 72,504,712 features down to 10 million.
This has the advantage of using ~14% of the RAM but the disadvantage of needing to store a copy of the matrix according to each attribute. In other words, instead of a single 486 GiB, you get 15x 67 GiB matrices.
Summary: Prefiltering uses 2x more disk space, but 7x less RAM.