Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle case where no samples exceed min_sample_reads loci parameter #3

Open
jackscanlan opened this issue Jul 16, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@jackscanlan
Copy link
Collaborator

Why this matters

  • Users could accidentally set min_sample_reads too high and need to know this is why the run fails
  • Could occur when doing extreme subsampling, as it is more likely very small numbers of samples will all fail the filter

Ideal behaviour

Would be good if PHYLOSEQ_FILTER still produces the expected output files when all samples are removed from a locus, but they're just empty files

Detail

Error currently occurs when params.subsample = 1 using test_data/dual (-profile test), as min_sample_reads = 1000:

ERROR ~ Error executing process > 'FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)'

Caused by:
  Process `FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)` terminated with an error exit status (1)

Command executed:

  #!/usr/bin/env Rscript
  
  ### defining Nextflow environment variables as R variables
  ## input channel variables
  pcr_primers =           "fwhF2-fwhR2nDac"
  ps =                    "ps_unfiltered_fwhF2-fwhR2nDac.rds"
  target_kingdom =        "Metazoa"
  target_phylum =         "Arthropoda"
  target_class =          "Insecta"
  target_order =          "Diptera"
  target_family =         "NA"
  target_genus =          "NA"
  target_species =        "NA"
  min_sample_reads =      "1000"
  min_taxa_reads =        "NA"
  min_taxa_ra =           "1e-4"
  
  ## global variables
  projectDir = "/group/pathogens/IAWS/Personal/JackS/dev/freyr"
  params_dict = "[help:null, data_folder:test_data/dual, refdir:reference, samplesheet:test_data/dual/samplesheet_read_dir.csv, loci_params:test_data/dual/loci_params.csv, extension:null, illumina:true, pacbio:false, nanopore:false, paired:true, high_sensitivity:true, primer_n_trim:false, threads:null, rdata:true, subsample:1, slurm_account:pathogens, max_memory:2.GB, max_cpus:1, max_time:10.m]"
  
  tryCatch({
  ### source functions and themes, load packages, and import Nextflow params
  ### from "bin/process_start.R"
  sys.source("/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/process_start.R", envir = .GlobalEnv)
  
  ### run module code
  sys.source(
      "/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/phyloseq_filter.R", # run script
      envir = .GlobalEnv # this allows import of existing objects like projectDir
  )
  }, finally = {
  ### save R environment for debugging
  if ("true" == "true") { save.image(file = "FREYR:PHYLOSEQ_FILTER_2.rda") } 
  })

Command exit status:
  1

Command output:
  
  2024-07-16T15:22:30 Pulling Image: docker:jackscanlan/piperline-multi:0.0.1, status: READY
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  [1] "Input variable 'projectDir' = /group/pathogens/IAWS/Personal/JackS/dev/freyr"
  [1] "Input variable 'pcr_primers' = fwhF2-fwhR2nDac"
  [1] "Input variable 'ps' = ps_unfiltered_fwhF2-fwhR2nDac.rds"
  [1] "Input variable 'target_kingdom' = Metazoa"
  [1] "Input variable 'target_phylum' = Arthropoda"
  [1] "Input variable 'target_class' = Insecta"
  [1] "Input variable 'target_order' = Diptera"
  [1] "Input variable 'target_family' = NA"
  [1] "Input variable 'target_genus' = NA"
  [1] "Input variable 'target_species' = NA"
  [1] "Input variable 'min_sample_reads' = 1000"
  [1] "Input variable 'min_taxa_reads' = NA"
  [1] "Input variable 'min_taxa_ra' = 1e-4"

Command error:
  Loading required package: BiocGenerics
  
  Attaching package: ‘BiocGenerics’
  
  The following objects are masked from ‘package:stats’:
  
      IQR, mad, sd, var, xtabs
  
  The following objects are masked from ‘package:base’:
  
      Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
      as.data.frame, basename, cbind, colnames, dirname, do.call,
      duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
      lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
      pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
      tapply, union, unique, unsplit, which.max, which.min
  
  Loading required package: S4Vectors
  Loading required package: stats4
  
  Attaching package: ‘S4Vectors’
  
  The following objects are masked from ‘package:base’:
  
      I, expand.grid, unname
  
  Loading required package: IRanges
  Loading required package: XVector
  Loading required package: GenomeInfoDb
  Error in validObject(.Object) : invalid class “otu_table” object: 
   OTU abundance data must have non-zero dimensions.
  Calls: tryCatch ... .nextMethod -> callNextMethod -> .nextMethod -> validObject
  In addition: There were 23 warnings (use warnings() to see them)
  Execution halted

Work dir:
  /group/pathogens/IAWS/Personal/JackS/dev/freyr/work/31/d78d5fc68ccb75f471371649ac5a1d

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

This occurs because the following code does not produce an "empty" phyloseq object when pruning out all samples, instead returning an error:

ps1 %>%
        phyloseq::prune_samples(phyloseq::sample_sums(.)>=min_sample_reads, .)

Have added a more informative error to phyloseq_filter.R, but this doesn't really solve the issue (ie. produce empty outputs and allow the run to continue):

if(min_sample_reads > 0){
    if (all(phyloseq::sample_sums(ps1)>=min_sample_reads) == FALSE) {
        stop(paste0("ERROR: No samples contained reads above the minimum threshold of ", min_sample_reads, " -- consider lowering this value"))
        }
...
}

To-do:

Need a way to either:

  1. produce empty phyloseq objects when pruning, which could be used to produce empty output files, or
  2. produce empty output files directly
@jackscanlan jackscanlan added the bug Something isn't working label Jul 16, 2024
@alexpiper
Copy link

There was a little bug in the error handling you added which caused it to stop when any samples were below min_sample_reads, addressed in cf917fb

@alexpiper
Copy link

Not sure there will be a way to generate an empty phyloseq object, with the way phyloseq::otu_table() is set up.

It may be worth just using base R to filter the matrices, rather than relying on phyloseq. The guts of the phyloseq function are doing this anyway: https://github.com/joey711/phyloseq/blob/master/R/transform_filter-methods.R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants