instructions.Rmd

# Project description

In this project, you will work in groups to conduct gene-centric mapping of functional and phylogenetic anchors encoded in microbial genomes sourced from the Saanich Inlet water column. Saanich Inlet is a seasonally anoxic fjord on the coast of Vancouver Island British Columbia that serves as a model ecosystem for studying microbial community responses to ocean deoxygenation. 

Numerous studies from Saanich Inlet over the years have identified key microbial players mediating coupled biogeochemical cycling of carbon, nitrogen and sulfur extensible to expanding marine oxygen minimum zones throughout the global ocean. Here, you will use metagenomic and metatranscriptomic data sets generated from Cruise 72 spanning 7 depths in [Saanich Inlet](#the-saanich-inlet-data-set). In addition to fastq reads, the metagenomic data sets include assembled contigs, metagenome assembled genomes (MAGs) and single-cell amplified genomes (SAGs) spanning the genomic information hierarchy. 

The Cruise 72 metagenomic and metatranscriptomic data sets can be evaluated in relation to geochemical parameter information to better resolve both the metabolic potential and gene expression patterns of microbial communities inhabiting the Saanich Inlet water column. Each capstone project group will have the opportunity to use TreeSAPP and iToL to chart the abundance, expression and taxonomic diversity of functional and phylogenetic anchor genes represented in the current reference package collection. Groups can augment this collection with new reference packages produced during the TreeSAPP tutorial or at any time during the project phase depending on their research interests. Reference packages can also be updated using taxonomic information associated with the MAGs and SAGs described above.   

The underlying conceptual framework for this project is described in the three lectures associated with course Module 2 along with examples of data visualization approaches useful in developing a coherent scientific narrative. For a given gene encoding a biochemical transformation consider the extent to which this functionality is distributed within the community and how it is represented at different levels of biological information flow e.g. DNA versus RNA.    

## Guiding research questions

Several options are provided to help you develop your capstone project, although each group is free to come up with an original plan of action. Please consult with the teaching team if you have questions.

1.  Select your reference package(s) for analysis based on one of the following options:

    a.  Choose one or more pathways of a geochemical cycle e.g. nitrogen with reference packages already available for TreeSAPP (see table of available [TreeSAPP reference packages]). For example, you could select NapA, NirK, NirS, NorB, NorC and NosZ to investigate denitrification in the Saanich Inlet water column.

    b.  Create new reference packages to expand the TreeSAPP collection and analyze the results. You could select a pthway within a biogeocehmical cycle which does NOT have reference packages available covering the complete pathway e.g. sulfur cycle including sulfur oxidation and DMSP conversion, or a completely new pathway with no reference packages avaialble. 

    c.  Select all reference packages in the collection including those developed by the class during the treesapp create tutorial and map the results. Focus on evaluating the purity of each reference package and update as needed to reflect diversity of genes endemic to the Saanich Inlet water column.

2.  Perform a preliminary analysis using the Saanich Inlet data for all depths.

3.  Update reference packages as needed based on preliminary analysis.

4.  Look for patterns in gene abundance, expression and diversity as a function of water column geochemical parameter information.

### Questions for project with a focus on biogeochemical cycles (repeated in report structure below)

1. How does gene and transcript abundance vary across depths for each reference package for a given biogeochemical cycle?

2. How does microbial diversity differ among and between steps within a pathway e.g. denitrification? Are these trends similar for both genes and transcripts at different depths?

3. What is the taxonomic breakdown of genes and transcripts within a given biogeochemcial cycle? Can you discern evidence for horizontal gene transfer of one or more functional anchor genes?

4. Can you identify evidence for distributed metabolism within one or more pathways? Are these trends similar for both genes and transcripts at different depths?  

5. How do answers to questions 2 and 3 vary depending on the taxonomic rank used in the analysis?   

6. How does the abundance, expression and taxonomy of identified genes relate to water column geochemical parameter information (use the geochemical data in Saanich_Data.csv from our previous data science sessions)?

### Questions for project with a focus on using all reference packages

In addition to the questions provided above for biogeochemical cycles consider quality control metrics in your analysis.

1. How does the purity of each reference package impact your results? 

2. How much information is discarded due to insufficient taxonomic resolution? 

3. How much information is retained after updating reference packages with MAGs or SAGs?

4. What is the impact on diversity metrics following the use of updated reference packages?

## Resources

You will be provided with a script template for both the shell and R portion of your analysis (`treesapp_analysis.sh` and `treesapp_analysis.R`) that will guide you as you develop your code.

As stated in the project description, you will have access to metagenomic and metatranscriptomic data sets generated from Cruise 72 spanning 7 depths in [Saanich Inlet](#the-saanich-inlet-data-set), MAGs and SAGs.

## Your submission

Your final submission will consist of 3 separate files: the report itself (`docx` or `pdf`), one shell script `treesapp_analysis.sh`, and one R script `treesapp_analysis.R` (both script files must be in plain text format). The report should not contain any code, but should contain versions of software tools used and a high-level description of your workflow (i.e describe *what* was done and NOT *how*).

## Timeline

The following provides an outline as well as some specific milestones within the project.

```{r child = "child_Rmds/timeline.Rmd"}
```

## Reports

Reports should be formatted as per the [Instructions to Authors](https://jb.asm.org/sites/default/files/additional-assets/JB-ITA.pdf) for the [Journal of Bacteriology](https://jb.asm.org/).

Each group will submit **one** report with the sections below.

```{r child = "child_Rmds/report_structure.Rmd"}
```

<!-- ## Addendum -->

<!-- ### Abundance -->

<!-- The workflow introduced in the [treesapp abundance](#treesapp-abundance) tutorial was supposed to serve as the guide to generating TPM values for your capstone project. However, it has come to the attention of the teaching team that `treesapp abundance` fails to generate the simple_bar.txt files for viewing the distribution of TPM values across a reference package's phylogenetic tree. Therefore, we are providing you another workflow that will generate both the classifications file (marker_contig_map.tsv) with TPM values and all simplebar.txt files for iTOL. -->

<!-- #### Generate the JPlace with all classifications -->
<!-- With the concatenated metagenome FASTA file for SI072 (`MetaG_Assemblies/SI072_All_contigs.fa`; see [server data locations](#the-saanich-inlet-data-set#data-on-the-server)) run `treesapp assign` without FASTQ reads and abundance flags to generate a single jplace file containing all reference package homologs across all the depths. -->

<!-- #### Generate TPM values for metagenomes -->
<!-- Run `treesapp assign` seven times with abundance flags (\--rel_abund and \--metric TPM) and paths to metagenome FASTQ files to generate TPM values for each individual depth and the abundance files needed for iToL visualization. -->

<!-- This gives you a total of 8 `treesapp assign` runs with associated output files for the metagenome samples. -->

<!-- #### Generate TPM values for metatranscriptomes -->

<!-- Run `treesapp assign` with abundance flags once for each of the seven metatranscriptome samples, including their respective FASTQ file path in the command. This will give you another seven `treesapp assign` outputs for a total of 15 to complete the process for metagenome and metatranscriptome data sets. -->

<!-- #### Generate the file with all classifications and TPM values -->

<!-- Now concatenate the marker_contig_map.tsv files from the seven individual depth runs for analysis in `R`. -->

<!-- You will first need to write the header from *one* classification table to the combined table file (`SI072_combined_marker_contig_map.tsv` below), then write the classifications from all tables without the header. This is to avoid including the table's header multiple times partway through the table, which would confuse `R`. We're using `grep` to accomplish this by removing any lines that match a word in the header, 'Query'. -->

<!-- ```{bash, eval=F} -->
<!-- head -n 1 <path to treesapp assign output directory>/final_outputs/marker_contig_map.tsv >SI072_combined_marker_contig_map.tsv -->
<!-- cat <path to assign outputs>/*/final_outputs/marker_contig_map.tsv | grep -v Query >>SI072_combined_marker_contig_map.tsv -->
<!-- ``` -->

<!-- #### Viewing the placements and TPM bars in iTOL -->

<!-- For iToL visualization use the jplace file from `treesapp assign` with the concatenated metagenome FASTA to visualize the phylogenetic tree with all placements across depths, and use the <gene>_labels.txt file from the concatenated run to name the leaves. -->
<!-- To add the TPM abundance layers for each depth access the simple_bar.txt files from each individual depth run and drag and drop these files into the iToL GUI. -->