In the previous week we learned about how to do basic terminal commands like:
cd
,mkdir
head
,tail
,cat
,less
,grep
touch
,rm
,mv
,cp
wc
,du
history
nano
This week we will cover how to use them, and others, to do sequence specific analysis.
There are two common file formats for storing sequence data.
- fasta
- fastq
FASTA contains only sequence information:
$ head data/multiplexed.short.fa
>3024
ACCTAATATATAGTCTAGCGCTTTACGGAAGACAATGTATGTATTTCGGTTCCTGGAGAA
ACTATTGCATCTATTGCATAGGTAATCTTGCACGTCGCATCCCCGGTTCATTTTCTGCGT
TTCCATCTTGCACTTCAATAGCATATCTTTGTTAATCAG
>D00468:259:HYKTMBCX2:1:1101:10173:18877
ACCTTTATTTAGATTTGCTAAAGCGCGTCTTATAGTAATCGTGCAGTAAATTTTGTAATT
CAGTTCCAGTTCACTTTATCATGATAGTGTTTTATCAGTTTCGAATTTTTGGCACACGAA
AGCTTGTTTTCCTTTATAATTCCAG
>4100
ACCTAGCATACACTAACCAGAATCGGTAAAAAGACGGAAAAGAGATCATATAACCTCGAC
Fastq files also contain quality scores.
$ head data/multiplexed.fq
@3024
ACCTAATATATAGTCTAGCGCTTTACGGAAGACAATGTATGTATTTCGGTTCCTGGAGAAACTATTGCATCTATTGCATAGGTAATCTTGCACGTCGCATCCCCGGTTCATTTTCTGCGTTTCCATCTTGCACTTCAATAGCATATCTTTGTTAATCAG
+
]]]]CCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CC;CC;CCCCCCCCC;;]]]]
@D00468:259:HYKTMBCX2:1:1101:10173:18877
ACCTTTATTTAGATTTGCTAAAGCGCGTCTTATAGTAATCGTGCAGTAAATTTTGTAATTCAGTTCCAGTTCACTTTATCATGATAGTGTTTTATCAGTTTCGAATTTTTGGCACACGAAAGCTTGTTTTCCTTTATAATTCCAG
+
]]]]GGGGGGGIIIIIIIIGIIIIIIIGGGIIIIIIIIIIGIIIIGIGIIGGGIIIGGIIGIGGGGGIGIIGIIIIGGIIIIGGGGIIIGGIIIIGIIGIGIIIIGIIGGGGGIIIIIIIIIGIIGGIIIGGGIIGIGGII]]]]
@4100
ACCTAGCATACACTAACCAGAATCGGTAAAAAGACGGAAAAGAGATCATATAACCTCGACAGGTTCTTCTTTATATGAGATGGAAATAAACTTATCTAATAGAATTTATAGGAAAAGAACAGAGGCAACGTTAAACGCATCGAAATGTAAAGTATTCAG
What do you notice about the two different filetypes?
Links
There are plenty of ways to count the number of reads in a file. We'll practice a few of them:
grep
fasta files for lines that start with>
and then count the lines.- Counting the number of lines in a fastq and dividing by 4.
grep
fastq files for lines that are only '+'
It is important to learn and practice using native terminal commands. However, you'll eventually run into a wall in how complex you can make something.
On the server, I've installed seqkit
, a cli tool for interacting with fastx files.
seqkit -h
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 2.4.0
Author: Wei Shen <[email protected]>
Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962
Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package to
read and write gzip file, and the outputted gzip file would be slighty
larger than files generated by GNU gzip.
Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,
therefore there is no need to pipe the result to gzip/pigz.
Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.
Bzip2 format is supported since v2.4.0.
Compression level:
format range default comment
gzip 1-9 5 https://github.com/klauspost/pgzip sets 5 as the default value.
xz NA NA https://github.com/ulikunitz/xz does not support.
zstd 1-4 2 roughly equals to zstd 1, 3, 7, 11, respectively.
bzip 1-9 6 https://github.com/dsnet/compress
Usage:
seqkit [command]
Available Commands:
amplicon extract amplicon (or specific region around it) via primer(s)
bam monitoring and online histograms of BAM record features
common find common sequences of multiple files by id/name/sequence
concat concatenate sequences with the same ID from multiple files
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
duplicate duplicate sequences N times
fa2fq retrieve corresponding FASTQ records by a FASTA file
faidx create FASTA index file and extract subsequence
fish look for short sequences in larger sequences using local alignment
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (and length, GC content, average quality...)
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
head print first N FASTA/Q records
head-genome print sequences of the first genome with common prefixes in name
locate locate subsequences/motifs, mismatch allowed
mutate edit sequence (point mutation, insertion, deletion)
pair match up paired-end reads from two fastq files
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
rmdup remove duplicated sequences by ID/name/sequence
sample sample sequences by number or proportion
sana sanitize broken single line FASTQ files
scat real time recursive concatenation and streaming of fastx files
seq transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
shuffle shuffle sequences
sliding extract subsequences in sliding windows
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
sum compute message digest for all sequences in FASTA/Q files
tab2fx convert tabular format to FASTA/Q format
translate translate DNA/RNA to protein sequence (supporting ambiguous bases)
version print version information and check for update
watch monitoring and online histograms of sequence features
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
-h, --help help for seqkit
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)
We'll only scratch the surface of what seqkit
can do.
Getting basic information about sequence files is trivial.
seqkit stats -h
simple statistics of FASTA/Q files
Columns:
1. file input file, "-" for STDIN
2. format FASTA or FASTQ
3. type DNA, RNA, Protein or Unlimit
4. num_seqs number of sequences
5. sum_len number of bases or residues , with gaps or spaces counted
6. min_len minimal sequence length , with gaps or spaces counted
7. avg_len average sequence length , with gaps or spaces counted
8. max_len miximal sequence length , with gaps or spaces counted
9. Q1 first quartile of sequence length , with gaps or spaces counted
10. Q2 median of sequence length , with gaps or spaces counted
11. Q3 third quartile of sequence length , with gaps or spaces counted
12. sum_gap number of gaps
13. N50 N50. https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50
14. Q20(%) percentage of bases with the quality score greater than 20
15. Q30(%) percentage of bases with the quality score greater than 30
16. GC(%) percentage of GC content
Attentions:
1. Sequence length metrics (sum_len, min_len, avg_len, max_len, Q1, Q2, Q3)
count the number of gaps or spaces. You can remove them with "seqkit seq -g":
seqkit seq -g input.fasta | seqkit stats
Tips:
1. For lots of small files (especially on SDD), use big value of '-j' to
parallelize counting.
2. Extract one metric with csvtk (https://github.com/shenwei356/csvtk):
seqkit stats -Ta input.fastq.gz | csvtk cut -t -f "Q30(%)" | csvtk del-header
Usage:
seqkit stats [flags]
Aliases:
stats, stat
Flags:
-a, --all all statistics, including quartiles of seq length, sum_gap, N50
-b, --basename only output basename of files
-E, --fq-encoding string fastq quality encoding. available values: 'sanger', 'solexa',
'illumina-1.3+', 'illumina-1.5+', 'illumina-1.8+'. (default "sanger")
-G, --gap-letters string gap letters (default "- .")
-h, --help help for stats
-e, --skip-err skip error, only show warning message
-i, --stdin-label string label for replacing default "-" for stdin (default "-")
-T, --tabular output in machine-friendly tabular format
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)
Running it on our files.
seqkit stats data/multiplexed.fq data/multiplexed.short.fa
file format type num_seqs sum_len min_len avg_len max_len
data/multiplexed.fq FASTQ DNA 1,900 264,787 43 139.4 159
data/multiplexed.short.fa FASTA DNA 10 1,380 66 138 159
This is useful for filtering.
seqkit grep -h
search sequences by ID/name/sequence/sequence motifs, mismatch allowed
Attentions:
0. By default, we match sequence ID with patterns, use "-n/--by-name"
for matching full name instead of just ID.
1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
(ID/full header) by default. Please switch "-r/--use-regexp" on
for partly matching.
2. When searching by sequences, its partly matching, and both positive
and negative strands are searched.
Mismatch is allowed using flag "-m/--max-mismatch", you can increase
the value of "-j/--threads" to accelerate processing.
3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
But do not use degenerate bases/residues in regular expression, you need
convert them to regular expression, e.g., change "N" or "X" to ".".
4. When providing search patterns (motifs) via flag '-p',
please use double quotation marks for patterns containing comma,
e.g., -p '"A{2,}"' or -p "\"A{2,}\"". Because the command line argument
parser accepts comma-separated-values (CSV) for multiple values (motifs).
Patterns in file do not follow this rule.
5. The order of sequences in result is consistent with that in original
file, not the order of the query patterns.
But for FASTA file, you can use:
seqkit faidx seqs.fasta --infile-list IDs.txt
6. For multiple patterns, you can either set "-p" multiple times, i.e.,
-p pattern1 -p pattern2, or give a file of patterns via "-f/--pattern-file".
You can specify the sequence region for searching with the flag -R (--region).
The definition of region is 1-based and with some custom design.
Examples:
1-based index 1 2 3 4 5 6 7 8 9 10
negative index 0-9-8-7-6-5-4-3-2-1
seq A C G T N a c g t n
1:1 A
2:4 C G T
-4:-2 c g t
-4:-1 c g t n
-1:-1 n
2:-2 C G T N a c g t
1:-1 A C G T N a c g t n
1:12 A C G T N a c g t n
-12:-1 A C G T N a c g t n
Usage:
seqkit grep [flags]
Flags:
-n, --by-name match by full name instead of just ID
-s, --by-seq search subseq on seq, both positive and negative strand are searched, and
mismatch allowed using flag -m/--max-mismatch
-c, --circular circular genome
-C, --count just print a count of matching records. with the -v/--invert-match flag,
count non-matching records
-d, --degenerate pattern/motif contains degenerate base
--delete-matched delete a pattern right after being matched, this keeps the firstly
matched data and speedups when using regular expressions
-h, --help help for grep
-i, --ignore-case ignore case
-I, --immediate-output print output immediately, do not use write buffer
-v, --invert-match invert the sense of matching, to select non-matching records
-m, --max-mismatch int max mismatch when matching by seq. For large genomes like human genome,
using mapping/alignment tools would be faster
-P, --only-positive-strand only search on positive strand
-p, --pattern strings search pattern (multiple values supported. Attention: use double
quotation marks for patterns containing comma, e.g., -p '"A{2,}"')
-f, --pattern-file string pattern file (one record per line)
-R, --region string specify sequence region for searching. e.g 1:12 for first 12 bases,
-12:-1 for last 12 bases
-r, --use-regexp patterns are regular expression
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)
How could we use this to solve our demultiplexing problem?
$ seqkit grep -C -s -p 'ACCT' data/multiplexed.fq
Would that work? Why or why not?
Discuss pipes for streaming data between commands.
$ seqkit gep -s -p 'ACCT' data/multiplexed.fq | seqkit stats
file format type num_seqs sum_len min_len avg_len max_len
- FASTQ DNA 1,855 258,359 43 139.3 159