Handle case where no samples exceed `min_sample_reads` loci parameter #3

jackscanlan · 2024-07-16T06:12:05Z

Why this matters

Users could accidentally set min_sample_reads too high and need to know this is why the run fails
Could occur when doing extreme subsampling, as it is more likely very small numbers of samples will all fail the filter

Ideal behaviour

Would be good if PHYLOSEQ_FILTER still produces the expected output files when all samples are removed from a locus, but they're just empty files

Detail

Error currently occurs when params.subsample = 1 using test_data/dual (-profile test), as min_sample_reads = 1000:

ERROR ~ Error executing process > 'FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)'

Caused by:
  Process `FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)` terminated with an error exit status (1)

Command executed:

  #!/usr/bin/env Rscript
  
  ### defining Nextflow environment variables as R variables
  ## input channel variables
  pcr_primers =           "fwhF2-fwhR2nDac"
  ps =                    "ps_unfiltered_fwhF2-fwhR2nDac.rds"
  target_kingdom =        "Metazoa"
  target_phylum =         "Arthropoda"
  target_class =          "Insecta"
  target_order =          "Diptera"
  target_family =         "NA"
  target_genus =          "NA"
  target_species =        "NA"
  min_sample_reads =      "1000"
  min_taxa_reads =        "NA"
  min_taxa_ra =           "1e-4"
  
  ## global variables
  projectDir = "/group/pathogens/IAWS/Personal/JackS/dev/freyr"
  params_dict = "[help:null, data_folder:test_data/dual, refdir:reference, samplesheet:test_data/dual/samplesheet_read_dir.csv, loci_params:test_data/dual/loci_params.csv, extension:null, illumina:true, pacbio:false, nanopore:false, paired:true, high_sensitivity:true, primer_n_trim:false, threads:null, rdata:true, subsample:1, slurm_account:pathogens, max_memory:2.GB, max_cpus:1, max_time:10.m]"
  
  tryCatch({
  ### source functions and themes, load packages, and import Nextflow params
  ### from "bin/process_start.R"
  sys.source("/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/process_start.R", envir = .GlobalEnv)
  
  ### run module code
  sys.source(
      "/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/phyloseq_filter.R", # run script
      envir = .GlobalEnv # this allows import of existing objects like projectDir
  )
  }, finally = {
  ### save R environment for debugging
  if ("true" == "true") { save.image(file = "FREYR:PHYLOSEQ_FILTER_2.rda") } 
  })

Command exit status:
  1

Command output:
  
  2024-07-16T15:22:30 Pulling Image: docker:jackscanlan/piperline-multi:0.0.1, status: READY
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  [1] "Input variable 'projectDir' = /group/pathogens/IAWS/Personal/JackS/dev/freyr"
  [1] "Input variable 'pcr_primers' = fwhF2-fwhR2nDac"
  [1] "Input variable 'ps' = ps_unfiltered_fwhF2-fwhR2nDac.rds"
  [1] "Input variable 'target_kingdom' = Metazoa"
  [1] "Input variable 'target_phylum' = Arthropoda"
  [1] "Input variable 'target_class' = Insecta"
  [1] "Input variable 'target_order' = Diptera"
  [1] "Input variable 'target_family' = NA"
  [1] "Input variable 'target_genus' = NA"
  [1] "Input variable 'target_species' = NA"
  [1] "Input variable 'min_sample_reads' = 1000"
  [1] "Input variable 'min_taxa_reads' = NA"
  [1] "Input variable 'min_taxa_ra' = 1e-4"

Command error:
  Loading required package: BiocGenerics
  
  Attaching package: ‘BiocGenerics’
  
  The following objects are masked from ‘package:stats’:
  
      IQR, mad, sd, var, xtabs
  
  The following objects are masked from ‘package:base’:
  
      Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
      as.data.frame, basename, cbind, colnames, dirname, do.call,
      duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
      lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
      pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
      tapply, union, unique, unsplit, which.max, which.min
  
  Loading required package: S4Vectors
  Loading required package: stats4
  
  Attaching package: ‘S4Vectors’
  
  The following objects are masked from ‘package:base’:
  
      I, expand.grid, unname
  
  Loading required package: IRanges
  Loading required package: XVector
  Loading required package: GenomeInfoDb
  Error in validObject(.Object) : invalid class “otu_table” object: 
   OTU abundance data must have non-zero dimensions.
  Calls: tryCatch ... .nextMethod -> callNextMethod -> .nextMethod -> validObject
  In addition: There were 23 warnings (use warnings() to see them)
  Execution halted

Work dir:
  /group/pathogens/IAWS/Personal/JackS/dev/freyr/work/31/d78d5fc68ccb75f471371649ac5a1d

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

This occurs because the following code does not produce an "empty" phyloseq object when pruning out all samples, instead returning an error:

ps1 %>%
        phyloseq::prune_samples(phyloseq::sample_sums(.)>=min_sample_reads, .)

Have added a more informative error to phyloseq_filter.R, but this doesn't really solve the issue (ie. produce empty outputs and allow the run to continue):

if(min_sample_reads > 0){
    if (all(phyloseq::sample_sums(ps1)>=min_sample_reads) == FALSE) {
        stop(paste0("ERROR: No samples contained reads above the minimum threshold of ", min_sample_reads, " -- consider lowering this value"))
        }
...
}

To-do:

Need a way to either:

produce empty phyloseq objects when pruning, which could be used to produce empty output files, or
produce empty output files directly

The text was updated successfully, but these errors were encountered:

alexpiper · 2024-07-18T04:45:13Z

There was a little bug in the error handling you added which caused it to stop when any samples were below min_sample_reads, addressed in cf917fb

alexpiper · 2024-07-18T05:43:23Z

Not sure there will be a way to generate an empty phyloseq object, with the way phyloseq::otu_table() is set up.

It may be worth just using base R to filter the matrices, rather than relying on phyloseq. The guts of the phyloseq function are doing this anyway: https://github.com/joey711/phyloseq/blob/master/R/transform_filter-methods.R

jackscanlan added the bug Something isn't working label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle case where no samples exceed `min_sample_reads` loci parameter #3

Handle case where no samples exceed `min_sample_reads` loci parameter #3

jackscanlan commented Jul 16, 2024

alexpiper commented Jul 18, 2024

alexpiper commented Jul 18, 2024

Handle case where no samples exceed min_sample_reads loci parameter #3

Handle case where no samples exceed min_sample_reads loci parameter #3

Comments

jackscanlan commented Jul 16, 2024

Why this matters

Ideal behaviour

Detail

To-do:

alexpiper commented Jul 18, 2024

alexpiper commented Jul 18, 2024

Handle case where no samples exceed `min_sample_reads` loci parameter #3

Handle case where no samples exceed `min_sample_reads` loci parameter #3