Skip to content

Latest commit

 

History

History
142 lines (98 loc) · 4.77 KB

assignment.md

File metadata and controls

142 lines (98 loc) · 4.77 KB

Demultiplex Samples

Introduction

For this assignment I have multiplexed data from the Snowflake Yeast Datset by adding 3' and 5' indexes and shuffling the reads back together. Your task is to split them back into their individual sample files using seqkit tools.

Goals

Create a directory structure as follows:

A set of statistics about the multiplexed data.

solution/muxed.stats.tsv

Files demultiplexed by barcode.

solution/SRR23803536.fastq
solution/SRR23803537.fastq
solution/SRR23803538.fastq
solution/SRR23803539.fastq

Sequences after removing the barcode.

solution/SRR23803536.trimmed.fastq.gz
solution/SRR23803537.trimmed.fastq.gz
solution/SRR23803538.trimmed.fastq.gz
solution/SRR23803539.trimmed.fastq.gz

Summary of the demultiplexed data.

solution/demux.stats.tsv

Datasets

  • data/multiplexed.fq - A fastq file containing a subset of reads.
  • data/sample_sheet.csv - A csv file indicating the 3' and 5' barcodes as they appear on the read.

Walkthrough

Utitlizing the dataset provided, complete the following tasks. Create a content/wk03/solution/notes_{user}.md and, take notes, copy commands, and answer questions within that file document.

Mux statistics

Use seqkit stats to generate basic statistics about about the multiplexed data. Note the command and number of reads before demultiplexing in your solution document.

Demulitplex the samples

Using the data/sample_sheet.csv file as a sample sheet, demultiplex the reads in wk03/data/multiplexed.fq into individual files. The sample sheet defines the barcodes on the 3' and 5' ends as they appear on the sequence. Save the files to solution/{sample_name}.fastq (but do not commit it).

There are a number of strategies for doing this:

  • seqkit grep utilizing regular expression
  • Utilizing multiple seqkit grep statements
  • something else ...

However, they will all end up with the same information.

You should end with the following directory structure:

solution/SRR23803536.fastq
solution/SRR23803537.fastq
solution/SRR23803538.fastq
solution/SRR23803539.fastq

In your notes document include the commands you used and a description of how you figured out the commands.

Trim the barcodes

Now that we know which sample each read belongs to, the barcode is now extranious information. We should remove it from the read before downstream processing or it may lead to spurrious alignments.

Explore the seqkit toolbox through the website, the help-file, google search, Biostars posts, etc. Then, utilize seqkit to create gzipped fastq files that only contained the trimmed sequence.

You should end with the following directory structure:

solution/SRR23803536.trimmed.fastq.gz
solution/SRR23803537.trimmed.fastq.gz
solution/SRR23803538.trimmed.fastq.gz
solution/SRR23803539.trimmed.fastq.gz

In your notes document include the commands you used and a description of how you figured out the commands.

Demux statistics

Use seqkit stats to generate basic statistics about about the demultiplexed and trimmed data. Save this summary as a tsv formatted file at the path solution/demuxed.stats.tsv.

In your notes document include the command you used, a markdown formatted version of the table, and a description of the results.

Reflection Questions

Write a brief (~1 paragraph) reflecting on the results. Consider the following questions.

  • Are there an equal number of reads from each sample?
  • Are the read lengths the same between each sample?

Clean up

When you are done, check the files in the content/wk05 folder. Ensure that you haven't commited any fastq files or gz files into the repo. If you have, use git rm {path} to remove the file.

Project Task

Exploration

I want you to use the seqkit library on your own files. Some ideas:

  • Use seqkit head to peak at them.
  • Use seqkit stats to count and quantify them.
  • Use seqkit grep to search for sequences that contain an 'interesting' sequence.
  • Use seqkit amplicon to do a computational PCR.

In your project readme discuss what you did and what you found. Include commands and outputs and where appropriate link to the biology.

Pack up your files

.fastq files are needlessly large and paradoxically slower to read than compressed information. seqkit allows you to gzip files making them significantly smaller. Use the terminal documentation, seqkit website FAQ, web search, ChatGPT, etc to figure out how to gzip your files. Use du -h to show the original and compressed files and their relative sizes. Then, delete the unpacked ones. Then, write a brief markdown file in your repo directory describing the steps you took.

If you already have .fastq.gz files, then you don't need to do anything.