We will be working with the Drosophila melanogaster genome. You can start at flybase.org. Go to the most current download genomes section and download the gzipped fasta file for all chromosomes.
In order to download the data from the given website, we can use wget or curl.
$ wget -r -np -N ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/fasta/md5sum.txt & wget -r -np -N ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/fasta/dmel-all-chromosome-r6.24.fasta.gz
--2018-10-22 13:06:53-- ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/fasta/dmel-all-chromosome-r6.24.fasta.gz
=> "ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/fasta/dmel-all-chromosome-r6.24.fasta.gz"
==> CWD not required.
==> PASV ... unlink: No such file or directory
--2018-10-22 13:06:53-- ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/fasta/md5sum.txt
=> "ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/fasta/md5sum.txt"
==> CWD not required.
==> PASV ... done. ==> RETR dmel-all-chromosome-r6.24.fasta.gz ... done. ==> RETR md5sum.txt ... done.
Length: 42465980 (40M)
0% [ ] 0 --.-K/s done.
Length: 1489 (1.5K)
100%[==================================>] 1,489 --.-K/s in 0s
2018-10-22 13:06:53 (241 MB/s) - "ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/fasta/md5sum.txt" saved [1489]
Verify the file integrity of the gzipped fasta file using a checksum
$ md5sum -c md5sum.txt
md5sum: dmel-all-aligned-r6.24.fasta.gz: No such file or directory
dmel-all-aligned-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-CDS-r6.24.fasta.gz: No such file or directory
dmel-all-CDS-r6.24.fasta.gz: FAILED open or read
dmel-all-chromosome-r6.24.fasta.gz: OK
md5sum: dmel-all-clones-r6.24.fasta.gz: No such file or directory
dmel-all-clones-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-exon-r6.24.fasta.gz: No such file or directory
dmel-all-exon-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-five_prime_UTR-r6.24.fasta.gz: No such file or directory
dmel-all-five_prime_UTR-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-gene_extended2000-r6.24.fasta.gz: No such file or directory
dmel-all-gene_extended2000-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-gene-r6.24.fasta.gz: No such file or directory
dmel-all-gene-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-intergenic-r6.24.fasta.gz: No such file or directory
dmel-all-intergenic-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-intron-r6.24.fasta.gz: No such file or directory
dmel-all-intron-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-miRNA-r6.24.fasta.gz: No such file or directory
dmel-all-miRNA-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-miscRNA-r6.24.fasta.gz: No such file or directory
dmel-all-miscRNA-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-ncRNA-r6.24.fasta.gz: No such file or directory
dmel-all-ncRNA-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-predicted-r6.24.fasta.gz: No such file or directory
dmel-all-predicted-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-pseudogene-r6.24.fasta.gz: No such file or directory
dmel-all-pseudogene-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-sequence_features-r6.24.fasta.gz: No such file or directory
dmel-all-sequence_features-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-synteny-r6.24.fasta.gz: No such file or directory
dmel-all-synteny-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-three_prime_UTR-r6.24.fasta.gz: No such file or directory
dmel-all-three_prime_UTR-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-transcript-r6.24.fasta.gz: No such file or directory
dmel-all-transcript-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-translation-r6.24.fasta.gz: No such file or directory
dmel-all-translation-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-transposon-r6.24.fasta.gz: No such file or directory
dmel-all-transposon-r6.24.fasta.gz: FAILED open or read
md5sum: dmel-all-tRNA-r6.24.fasta.gz: No such file or directory
dmel-all-tRNA-r6.24.fasta.gz: FAILED open or read
md5sum: WARNING: 21 of 22 listed files could not be read
- Total number of nucleotides
$ zgrep -v "^>" *.fasta*| tr -d -C 'A\T\G\C\N' | wc -m
143726002
- Total number of Ns
$ zgrep -v "^>" *.fasta* | tr -d -C 'N' | wc -m
1152978
- Total number of Sequences
$zgrep -c "^>" *.fasta*
1870
Go to the most current download genomes section at flybase.org and download the gzipped gtf annotation file for D. melanogaster.
In order to download the data from the given website, we can use wget or curl.
$ wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/gtf/*
--2018-11-02 13:09:55-- ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/gtf/*
=> ".listing"
Resolving ftp.flybase.net... 52.23.126.124
Connecting to ftp.flybase.net|52.23.126.124|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/gtf ... done.
==> PASV ... done. ==> LIST ... done.
[ <=> ] 266 --.-K/s in 0s
2018-11-02 13:09:56 (45.4 MB/s) - ".listing" saved [266]
Removed ".listing".
--2018-11-02 13:09:56-- ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/gtf/dmel-all-r6.24.gtf.gz
=> "dmel-all-r6.24.gtf.gz"
==> CWD not required.
==> PASV ... done. ==> RETR dmel-all-r6.24.gtf.gz ... done.
Length: 3897419 (3.7M)
100%[======================================>] 3,897,419 2.66M/s in 1.4s
2018-11-02 13:09:58 (2.66 MB/s) - "dmel-all-r6.24.gtf.gz" saved [3897419]
--2018-11-02 13:09:58-- ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.24_FB2018_05/gtf/md5sum.txt
=> "md5sum.txt.1"
==> CWD not required.
==> PASV ... done. ==> RETR md5sum.txt ... done.
Length: 56
100%[======================================>] 56 --.-K/s in 0s
2018-11-02 13:09:58 (8.34 MB/s) - "md5sum.txt.1" saved [56]
Verify the file integrity of the gzipped gtf annotation using a checksum
$ md5sum -c md5sum.txt.1
dmel-all-r6.24.gtf.gz: OK
- Total number of features of each type, sorted from the most common to the least common
$ zcat *.gtf* | cut -f3 | sort | uniq -c | sort -nr | nl
1 187315 exon
2 161014 CDS
3 46339 5UTR
4 33358 3UTR
5 30591 start_codon
6 30533 stop_codon
7 30507 mRNA
8 17772 gene
9 2961 ncRNA
10 485 miRNA
11 334 pseudogene
12 312 tRNA
13 299 snoRNA
14 262 pre_miRNA
15 115 rRNA
16 32 snRNA
- Total number of genes per chromosome arm (X, Y, 2L, 2R, 3L, 3R, 4)
$ zcat *.gtf* | grep -P '^\S+\s\S+\s+gene\b'| cut -f1 | grep -Eo 'X|Y|2L|2R|3L|3R|4' | sort |uniq -c | sort -nr | nl
1 4202 3R
2 3628 2R
3 3501 2L
4 3464 3L
5 2676 X
6 122 4
7 113 Y