Skip to content

Latest commit

 

History

History
380 lines (316 loc) · 18.2 KB

lecture.md

File metadata and controls

380 lines (316 loc) · 18.2 KB

Basic Command Line Sequence Analysis

Introduction

In the previous week we learned about how to do basic terminal commands like:

  • cd, mkdir
  • head, tail, cat, less, grep
  • touch, rm, mv, cp
  • wc, du
  • history
  • nano

This week we will cover how to use them, and others, to do sequence specific analysis.

Sequence data

There are two common file formats for storing sequence data.

  • fasta
  • fastq

FASTA contains only sequence information:

$ head data/multiplexed.short.fa 
>3024
ACCTAATATATAGTCTAGCGCTTTACGGAAGACAATGTATGTATTTCGGTTCCTGGAGAA
ACTATTGCATCTATTGCATAGGTAATCTTGCACGTCGCATCCCCGGTTCATTTTCTGCGT
TTCCATCTTGCACTTCAATAGCATATCTTTGTTAATCAG
>D00468:259:HYKTMBCX2:1:1101:10173:18877
ACCTTTATTTAGATTTGCTAAAGCGCGTCTTATAGTAATCGTGCAGTAAATTTTGTAATT
CAGTTCCAGTTCACTTTATCATGATAGTGTTTTATCAGTTTCGAATTTTTGGCACACGAA
AGCTTGTTTTCCTTTATAATTCCAG
>4100
ACCTAGCATACACTAACCAGAATCGGTAAAAAGACGGAAAAGAGATCATATAACCTCGAC

Fastq files also contain quality scores.

$ head data/multiplexed.fq
@3024
ACCTAATATATAGTCTAGCGCTTTACGGAAGACAATGTATGTATTTCGGTTCCTGGAGAAACTATTGCATCTATTGCATAGGTAATCTTGCACGTCGCATCCCCGGTTCATTTTCTGCGTTTCCATCTTGCACTTCAATAGCATATCTTTGTTAATCAG
+
]]]]CCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CC;CC;CCCCCCCCC;;]]]]
@D00468:259:HYKTMBCX2:1:1101:10173:18877
ACCTTTATTTAGATTTGCTAAAGCGCGTCTTATAGTAATCGTGCAGTAAATTTTGTAATTCAGTTCCAGTTCACTTTATCATGATAGTGTTTTATCAGTTTCGAATTTTTGGCACACGAAAGCTTGTTTTCCTTTATAATTCCAG
+
]]]]GGGGGGGIIIIIIIIGIIIIIIIGGGIIIIIIIIIIGIIIIGIGIIGGGIIIGGIIGIGGGGGIGIIGIIIIGGIIIIGGGGIIIGGIIIIGIIGIGIIIIGIIGGGGGIIIIIIIIIGIIGGIIIGGGIIGIGGII]]]]
@4100
ACCTAGCATACACTAACCAGAATCGGTAAAAAGACGGAAAAGAGATCATATAACCTCGACAGGTTCTTCTTTATATGAGATGGAAATAAACTTATCTAATAGAATTTATAGGAAAAGAACAGAGGCAACGTTAAACGCATCGAAATGTAAAGTATTCAG

What do you notice about the two different filetypes?

Links

Counting reads in a file

There are plenty of ways to count the number of reads in a file. We'll practice a few of them:

  • grep fasta files for lines that start with > and then count the lines.
  • Counting the number of lines in a fastq and dividing by 4.
  • grep fastq files for lines that are only '+'

Using a purpose built tool

It is important to learn and practice using native terminal commands. However, you'll eventually run into a wall in how complex you can make something.

On the server, I've installed seqkit, a cli tool for interacting with fastx files.

seqkit -h
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 2.4.0

Author: Wei Shen <[email protected]>

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962


Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package to
read and write gzip file, and the outputted gzip file would be slighty
larger than files generated by GNU gzip.

Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,
therefore there is no need to pipe the result to gzip/pigz.

Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.
Bzip2 format is supported since v2.4.0.

Compression level:
  format   range   default  comment
  gzip     1-9     5        https://github.com/klauspost/pgzip sets 5 as the default value.
  xz       NA      NA       https://github.com/ulikunitz/xz does not support.
  zstd     1-4     2        roughly equals to zstd 1, 3, 7, 11, respectively.
  bzip     1-9     6        https://github.com/dsnet/compress

Usage:
  seqkit [command] 

Available Commands:
  amplicon        extract amplicon (or specific region around it) via primer(s)
  bam             monitoring and online histograms of BAM record features
  common          find common sequences of multiple files by id/name/sequence
  concat          concatenate sequences with the same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  duplicate       duplicate sequences N times
  fa2fq           retrieve corresponding FASTQ records by a FASTA file
  faidx           create FASTA index file and extract subsequence
  fish            look for short sequences in larger sequences using local alignment
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (and length, GC content, average quality...)
  genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed
  head            print first N FASTA/Q records
  head-genome     print sequences of the first genome with common prefixes in name
  locate          locate subsequences/motifs, mismatch allowed
  mutate          edit sequence (point mutation, insertion, deletion)
  pair            match up paired-end reads from two fastq files
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by ID/name/sequence
  sample          sample sequences by number or proportion
  sana            sanitize broken single line FASTQ files
  scat            real time recursive concatenation and streaming of fastx files
  seq             transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
  shuffle         shuffle sequences
  sliding         extract subsequences in sliding windows
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)
  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  sum             compute message digest for all sequences in FASTA/Q files
  tab2fx          convert tabular format to FASTA/Q format
  translate       translate DNA/RNA to protein sequence (supporting ambiguous bases)
  version         print version information and check for update
  watch           monitoring and online histograms of sequence features

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on
                                        which seqkit guesses the sequence type (0 for whole seq)
                                        (default 10000)
      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
                                        for the range and default value for each format (default -1)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
                                        Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are
                                        appended to files from cli arguments
  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it
                                        automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable
                                        SEQKIT_THREADS) (default 4)

We'll only scratch the surface of what seqkit can do.

Seqkit stats

Getting basic information about sequence files is trivial.

seqkit stats -h
simple statistics of FASTA/Q files

Columns:

  1.  file      input file, "-" for STDIN
  2.  format    FASTA or FASTQ
  3.  type      DNA, RNA, Protein or Unlimit
  4.  num_seqs  number of sequences
  5.  sum_len   number of bases or residues       , with gaps or spaces counted
  6.  min_len   minimal sequence length           , with gaps or spaces counted
  7.  avg_len   average sequence length           , with gaps or spaces counted
  8.  max_len   miximal sequence length           , with gaps or spaces counted
  9.  Q1        first quartile of sequence length , with gaps or spaces counted
  10. Q2        median of sequence length         , with gaps or spaces counted
  11. Q3        third quartile of sequence length , with gaps or spaces counted
  12. sum_gap   number of gaps
  13. N50       N50. https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50
  14. Q20(%)    percentage of bases with the quality score greater than 20
  15. Q30(%)    percentage of bases with the quality score greater than 30
  16. GC(%)     percentage of GC content
  
Attentions:
  1. Sequence length metrics (sum_len, min_len, avg_len, max_len, Q1, Q2, Q3)
     count the number of gaps or spaces. You can remove them with "seqkit seq -g":
         seqkit seq -g input.fasta | seqkit stats

Tips:
  1. For lots of small files (especially on SDD), use big value of '-j' to
     parallelize counting.
  2. Extract one metric with csvtk (https://github.com/shenwei356/csvtk):
         seqkit stats -Ta input.fastq.gz | csvtk cut -t -f "Q30(%)" | csvtk del-header

Usage:
  seqkit stats [flags] 

Aliases:
  stats, stat

Flags:
  -a, --all                  all statistics, including quartiles of seq length, sum_gap, N50
  -b, --basename             only output basename of files
  -E, --fq-encoding string   fastq quality encoding. available values: 'sanger', 'solexa',
                             'illumina-1.3+', 'illumina-1.5+', 'illumina-1.8+'. (default "sanger")
  -G, --gap-letters string   gap letters (default "- .")
  -h, --help                 help for stats
  -e, --skip-err             skip error, only show warning message
  -i, --stdin-label string   label for replacing default "-" for stdin (default "-")
  -T, --tabular              output in machine-friendly tabular format

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on
                                        which seqkit guesses the sequence type (0 for whole seq)
                                        (default 10000)
      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
                                        for the range and default value for each format (default -1)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
                                        Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are
                                        appended to files from cli arguments
  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it
                                        automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable
                                        SEQKIT_THREADS) (default 4)

Running it on our files.

seqkit stats data/multiplexed.fq data/multiplexed.short.fa 
file                       format  type  num_seqs  sum_len  min_len  avg_len  max_len
data/multiplexed.fq        FASTQ   DNA      1,900  264,787       43    139.4      159
data/multiplexed.short.fa  FASTA   DNA         10    1,380       66      138      159

Seqkit grep

This is useful for filtering.

seqkit grep -h
search sequences by ID/name/sequence/sequence motifs, mismatch allowed

Attentions:

  0. By default, we match sequence ID with patterns, use "-n/--by-name"
     for matching full name instead of just ID.
  1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
     (ID/full header) by default. Please switch "-r/--use-regexp" on
     for partly matching.
  2. When searching by sequences, its partly matching, and both positive
     and negative strands are searched.
     Mismatch is allowed using flag "-m/--max-mismatch", you can increase
     the value of "-j/--threads" to accelerate processing.
  3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
     But do not use degenerate bases/residues in regular expression, you need
     convert them to regular expression, e.g., change "N" or "X"  to ".".
  4. When providing search patterns (motifs) via flag '-p',
     please use double quotation marks for patterns containing comma, 
     e.g., -p '"A{2,}"' or -p "\"A{2,}\"". Because the command line argument
     parser accepts comma-separated-values (CSV) for multiple values (motifs).
     Patterns in file do not follow this rule.
  5. The order of sequences in result is consistent with that in original
     file, not the order of the query patterns. 
     But for FASTA file, you can use:
        seqkit faidx seqs.fasta --infile-list IDs.txt
  6. For multiple patterns, you can either set "-p" multiple times, i.e.,
     -p pattern1 -p pattern2, or give a file of patterns via "-f/--pattern-file".

You can specify the sequence region for searching with the flag -R (--region).
The definition of region is 1-based and with some custom design.

Examples:

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n
          1:12    A C G T N a c g t n
        -12:-1    A C G T N a c g t n

Usage:
  seqkit grep [flags] 

Flags:
  -n, --by-name                match by full name instead of just ID
  -s, --by-seq                 search subseq on seq, both positive and negative strand are searched, and
                               mismatch allowed using flag -m/--max-mismatch
  -c, --circular               circular genome
  -C, --count                  just print a count of matching records. with the -v/--invert-match flag,
                               count non-matching records
  -d, --degenerate             pattern/motif contains degenerate base
      --delete-matched         delete a pattern right after being matched, this keeps the firstly
                               matched data and speedups when using regular expressions
  -h, --help                   help for grep
  -i, --ignore-case            ignore case
  -I, --immediate-output       print output immediately, do not use write buffer
  -v, --invert-match           invert the sense of matching, to select non-matching records
  -m, --max-mismatch int       max mismatch when matching by seq. For large genomes like human genome,
                               using mapping/alignment tools would be faster
  -P, --only-positive-strand   only search on positive strand
  -p, --pattern strings        search pattern (multiple values supported. Attention: use double
                               quotation marks for patterns containing comma, e.g., -p '"A{2,}"')
  -f, --pattern-file string    pattern file (one record per line)
  -R, --region string          specify sequence region for searching. e.g 1:12 for first 12 bases,
                               -12:-1 for last 12 bases
  -r, --use-regexp             patterns are regular expression

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on
                                        which seqkit guesses the sequence type (0 for whole seq)
                                        (default 10000)
      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
                                        for the range and default value for each format (default -1)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
                                        Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are
                                        appended to files from cli arguments
  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it
                                        automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable
                                        SEQKIT_THREADS) (default 4)

How could we use this to solve our demultiplexing problem?

$ seqkit grep -C -s -p 'ACCT' data/multiplexed.fq

Would that work? Why or why not?

Discuss pipes for streaming data between commands.

$ seqkit gep -s -p 'ACCT' data/multiplexed.fq | seqkit stats
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTQ   DNA      1,855  258,359       43    139.3      159