404
+ +Page not found
+ + +diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..a91bbfe --- /dev/null +++ b/404.html @@ -0,0 +1,120 @@ + + +
+ + + + +Page not found
+ + +A Phred score is a measure of the probability that a base call in a DNA sequencing read is incorrect. It is a logarithmic scale, meaning that a small change in the Phred score represents a large change in the probability of an error.
+$$Q = -10 \cdot \log_{10}(P)$$
+Where:
+Q is the PHRED score.
+P is the probability that the base was called incorrectly.
+For example:
+Q = 20: This corresponds to a 1 in 100 probability of an incorrect base call, or an accuracy of 99%.
+Q = 30: This corresponds to a 1 in 1000 probability of an incorrect base call, or an accuracy of 99.9%.
+Q = 40: This corresponds to a 1 in 10,000 probability of an incorrect base call, or an accuracy of 99.99%.
+# Print the header
+cat(sprintf("%-5s\t\t%-10s\n", "Phred", "Prob of"))
+cat(sprintf("%-5s\t\t%-10s\n", "score", "Incorrect call"))
+
+# Loop through Phred scores from 0 to 41
+for (phred in 0:41) {
+ cat(sprintf("%-5d\t\t%0.5f\n", phred, 10^(phred / -10)))
+}
+
+ASCII (American Standard Code for Information Interchange) is used to represent characters in computers. We can represent Phred scores using ASCII characters. The advantage is that the quality information can be esisly stored in text based FASTQ file.
+Not all ASCII characters are printable. The first printable ASCII character is !
and the decimal code for the character for !
is 33.
# Store output in a vector to fit on a slide
+output <- c(sprintf("%-8s %-8s", "Character", "ASCII #"))
+
+# Loop through ASCII values from 33 to 89
+for (i in 33:89) {
+ output <- c(output, sprintf("%-8s %-8d", intToUtf8(i), i))
+}
+
+# Print the output in a single block (e.g., to fit on a slide)
+cat(paste(output, collapse = "\n"))
+
+In a FASTQ file, Phred scores are represented as ASCII characters. These characters are converted back to numeric values (PHRED scores) based on the encoding scheme used:
+PHRED+33 Encoding (Sanger/Illumina 1.8+):
+The ASCII character for a quality score Q is calculated as:
+ASCII character=chr(Q+33)
+For example:
+chr(30 + 33) = chr(63)
, which corresponds to the ASCII character ?
.PHRED+64 Encoding (Illumina 1.3-1.7):
+The ASCII character for a quality score QQQ is calculated as:
+ASCII character=chr(Q+64)
+For example:
+chr(30 + 64) = chr(94)
, which corresponds to the ASCII character ^
.# Print the header
+cat(sprintf("%-5s\t\t%-10s\t%-6s\t\t%-10s\n", "Phred", "Prob. of", "ASCII", "ASCII"))
+cat(sprintf("%-5s\t\t%-10s\t%-6s\t%-10s\n", "score", "Error", "Phred+33", "Phred+64"))
+
+# Loop through Phred scores from 0 to 41
+for (phred in 0:41) {
+ # Calculate the probability of error
+ prob_error <- 10^(phred / -10)
+
+ # Convert Phred scores to ASCII characters
+ ascii_phred33 <- intToUtf8(phred + 33)
+ ascii_phred64 <- intToUtf8(phred + 64)
+
+ # Print the results in a formatted table
+ cat(sprintf("%-5d\t\t%0.5f\t\t%-6s\t\t%-10s\n",
+ phred, prob_error,
+ ascii_phred33, ascii_phred64))
+}
+
+
+ A Phred score is a measure of the probability that a base call in a DNA sequencing read is incorrect. It is a logarithmic scale, meaning that a small change in the Phred score represents a large change in the probability of an error.
+$$Q = -10 \cdot \log_{10}(P)$$
+Where:
+Q is the PHRED score.
+P is the probability that the base was called incorrectly.
+For example:
+Q = 20: This corresponds to a 1 in 100 probability of an incorrect base call, or an accuracy of 99%.
+Q = 30: This corresponds to a 1 in 1000 probability of an incorrect base call, or an accuracy of 99.9%.
+Q = 40: This corresponds to a 1 in 10,000 probability of an incorrect base call, or an accuracy of 99.99%.
+# Print the header
+cat(sprintf("%-5s\t\t%-10s\n", "Phred", "Prob of"))
+cat(sprintf("%-5s\t\t%-10s\n", "score", "Incorrect call"))
+
+# Loop through Phred scores from 0 to 41
+for (phred in 0:41) {
+ cat(sprintf("%-5d\t\t%0.5f\n", phred, 10^(phred / -10)))
+}
+
+Phred Prob of
+score Incorrect call
+0 1.00000
+1 0.79433
+2 0.63096
+3 0.50119
+4 0.39811
+5 0.31623
+6 0.25119
+7 0.19953
+8 0.15849
+9 0.12589
+10 0.10000
+11 0.07943
+12 0.06310
+13 0.05012
+14 0.03981
+15 0.03162
+16 0.02512
+17 0.01995
+18 0.01585
+19 0.01259
+20 0.01000
+21 0.00794
+22 0.00631
+23 0.00501
+24 0.00398
+25 0.00316
+26 0.00251
+27 0.00200
+28 0.00158
+29 0.00126
+30 0.00100
+31 0.00079
+32 0.00063
+33 0.00050
+34 0.00040
+35 0.00032
+36 0.00025
+37 0.00020
+38 0.00016
+39 0.00013
+40 0.00010
+41 0.00008
+
+ASCII (American Standard Code for Information Interchange) is used to represent characters in computers. We can represent Phred scores using ASCII characters. The advantage is that the quality information can be esisly stored in text based FASTQ file.
+Not all ASCII characters are printable. The first printable ASCII character is !
and the decimal code for the character for !
is 33.
# Store output in a vector to fit on a slide
+output <- c(sprintf("%-8s %-8s", "Character", "ASCII #"))
+
+# Loop through ASCII values from 33 to 89
+for (i in 33:89) {
+ output <- c(output, sprintf("%-8s %-8d", intToUtf8(i), i))
+}
+
+# Print the output in a single block (e.g., to fit on a slide)
+cat(paste(output, collapse = "\n"))
+
+Character ASCII #
+! 33
+" 34
+# 35
+$ 36
+% 37
+& 38
+' 39
+( 40
+) 41
+* 42
++ 43
+, 44
+- 45
+. 46
+/ 47
+0 48
+1 49
+2 50
+3 51
+4 52
+5 53
+6 54
+7 55
+8 56
+9 57
+: 58
+; 59
+< 60
+= 61
+> 62
+? 63
+@ 64
+A 65
+B 66
+C 67
+D 68
+E 69
+F 70
+G 71
+H 72
+I 73
+J 74
+K 75
+L 76
+M 77
+N 78
+O 79
+P 80
+Q 81
+R 82
+S 83
+T 84
+U 85
+V 86
+W 87
+X 88
+Y 89
+
+In a FASTQ file, Phred scores are represented as ASCII characters. These characters are converted back to numeric values (PHRED scores) based on the encoding scheme used:
+PHRED+33 Encoding (Sanger/Illumina 1.8+):
+The ASCII character for a quality score Q is calculated as:
+ASCII character=chr(Q+33)
+For example:
+chr(30 + 33) = chr(63)
, which corresponds to the ASCII character ?
.PHRED+64 Encoding (Illumina 1.3-1.7):
+The ASCII character for a quality score QQQ is calculated as:
+ASCII character=chr(Q+64)
+For example:
+chr(30 + 64) = chr(94)
, which corresponds to the ASCII character ^
.# Print the header
+cat(sprintf("%-5s\t\t%-10s\t%-6s\t\t%-10s\n", "Phred", "Prob. of", "ASCII", "ASCII"))
+cat(sprintf("%-5s\t\t%-10s\t%-6s\t%-10s\n", "score", "Error", "Phred+33", "Phred+64"))
+
+# Loop through Phred scores from 0 to 41
+for (phred in 0:41) {
+ # Calculate the probability of error
+ prob_error <- 10^(phred / -10)
+
+ # Convert Phred scores to ASCII characters
+ ascii_phred33 <- intToUtf8(phred + 33)
+ ascii_phred64 <- intToUtf8(phred + 64)
+
+ # Print the results in a formatted table
+ cat(sprintf("%-5d\t\t%0.5f\t\t%-6s\t\t%-10s\n",
+ phred, prob_error,
+ ascii_phred33, ascii_phred64))
+}
+
+Phred Prob. of ASCII ASCII
+score Error Phred+33 Phred+64
+0 1.00000 ! @
+1 0.79433 " A
+2 0.63096 # B
+3 0.50119 $ C
+4 0.39811 % D
+5 0.31623 & E
+6 0.25119 ' F
+7 0.19953 ( G
+8 0.15849 ) H
+9 0.12589 * I
+10 0.10000 + J
+11 0.07943 , K
+12 0.06310 - L
+13 0.05012 . M
+14 0.03981 / N
+15 0.03162 0 O
+16 0.02512 1 P
+17 0.01995 2 Q
+18 0.01585 3 R
+19 0.01259 4 S
+20 0.01000 5 T
+21 0.00794 6 U
+22 0.00631 7 V
+23 0.00501 8 W
+24 0.00398 9 X
+25 0.00316 : Y
+26 0.00251 ; Z
+27 0.00200 < [
+28 0.00158 = \
+29 0.00126 > ]
+30 0.00100 ? ^
+31 0.00079 @ _
+32 0.00063 A `
+33 0.00050 B a
+34 0.00040 C b
+35 0.00032 D c
+36 0.00025 E d
+37 0.00020 F e
+38 0.00016 G f
+39 0.00013 H g
+40 0.00010 I h
+41 0.00008 J i
+
+
+ You are in your home directory after you log into the system and are directed to the shell command prompt. This section will show you hot to explore Linux file system using shell commands.
+To understand Linux file system, you can image it as a tree structure.
+ +In Linux, a path is a unique location of a file or a directory in the file system.
+For convenience, Linux file system is usually thought of in a tree structure. On a standard Linux system you will find the layout generally follows the scheme presented below.
+The tree of the file system starts at the trunk or slash, indicated by a forward slash (/
). This directory, containing all underlying directories and files, is also called the root directory or “the root” of the file system.
%%bash
+## In your account, you will see a folder
+## with you account ID as the name
+cd ~
+echo $HOME
+
+/home/xie186
+
+An absolute path is defined as the location of a file or directory from the root directory(/). An absolute path starts from the root
of the tree (/
).
Here are some examples:
+/home/xie186
+/home/xie186/.bashrc
+
+Relative path is a path related to the present working directory:
+data/sample1/
and ../doc/
.
If you want to get the absolute path based on relative path, you can use readlink
with parameter -f
:
pwd
+readlink -f ../
+
+Once we enter into a Linux file system, we need to 1) know where we are; 2) how to get where we want; 3) how to know what files or directories we have in a particular path.
+pwd
In order to know where we are, we need to use pwd
command. The command pwd
is short for “print name of current/working directory”. It will return the full path of current directory.
Command pwd is almost always used by itself. This means you only need to type pwd
and press ENTER
%%bash
+pwd
+
+ls
After you know where you are, then you want to know what you have in that
+directory, we can use command ls
to list directory contents
Its syntax is:
+ls [option]... [file]...
+
+
+ls
with no option will list files and directories in bare format. Bare format means the detailed information (type, size, modified date and time, permissions and links etc) won’t be viewed. When you use ls
by itself, it will list files and directories in the current directory.
ls ~/
+ls -a
+ls -ld
+
+Linux command options can be combined without a space between them and with a single - (dash).
+The following command is a faster way to use the l and a options and gives the same output as the Linux command shown above.
+ls -lt ~/.bashrc
+
+-rw-r--r--. 1 xie186 zt-bioi611 1067 Aug 22 22:27 /home/xie186/.bashrc
+
+
+
+
+
+cd
Unlike pwd
, when you use cd
you usually need to provide the path (either absolute or relative path) which we want to enter.
If you didn’t provide any path information, you will change to home directory by default.
+Path | +Shortcuts | +Description | +
---|---|---|
Single dot | +. | +The current folder | +
Double dots | +.. | +The folder above the current folder | +
Tilde character | +~ | +Home directory (normally the directory:/home/my_login_name) | +
Dash | +- | +Your last working directory | +
Here are some examples:
+cd ~
+pwd
+ls
+ls ../
+##
+pwd
+cd ../
+pwd
+cd ./
+pwd
+
+Each directory has two entries in it at the start, with names .
(a link to itself) and ..
(a link to its parent directory). The exception, of course, is the root directory, where the ..
directory also refers to the root directory.
Sometimes you go to a new directory and do something, then you remember that you need to go to the previous working direcotry. To get back instantly, use a dash.
+%%bash
+
+# This is our current directory
+pwd
+
+# Let us go our home diretory
+cd ~
+
+# Check where we are
+pwd
+
+# Let us go to your previous working directory
+cd -
+# Check where we are now
+pwd
+
+/home/xie186/BIOI611_lab/docs
+/home/xie186
+/home/xie186/BIOI611_lab/docs
+/home/xie186/BIOI611_lab/docs
+
+In Linux, manipulations of files and directories are the most frequent work. In this section, you will learn how to copy, rename, remove, and create files and directories.
+cp
In Linux, command cp
can help you copy files and directories into a target directory.
mv
Move files/folders and rename file/folders using mv
:
# move file from one location to another
+mv file1 target_direcotry/
+# rename
+mv file1 file2
+mv file1 file2 file3 target_direcotry/
+
+
+mkdir
The syntax is shown as below:
+mkdir [OPTION ...] DIRECTORY ...
+
+Multiple directories can be specified when calling mkdir
mkdir directory1 directory2
+mkdir -p foo/bar/baz
+
+How to defining complex directory trees with one command:
+mkdir -p project/{software,results,doc/{html,info,pdf},scripts}
+
+
+Then you can view the directory using tree
.
rm
You can use rm to remove both files and directories.
+## You can remove one file.
+rm file1
+## `rm` can remove multiple files simutaneously
+rm file2 file3
+
+You can also use 'rm' to remove a folder. If a folder is empty, you can remove it using rm with -r
.
rm -r FOLDER
+
+If a folder is not empty, you can remove it using rm with -r
and -f
.
mkdir test_folder
+rm -r test_folder
+
+cat
, more
and less
The command cat is short for concatenate files and print on the standard output.
+The syntax is shown as below:
+cat [OPTION]... [FILE]...
+
+For small text file, cat can be used to view the files on the standard output.
+The command more is old utility. When the text passed to it is too large to fit on one screen, it pages it. You can scroll down but not up.
+The syntaxt of more
is shown below:
more [options] file [...]
+
+The command less was written by a man who was fed up with more’s inability to scroll backwards through a file. He turned less into an open source project and over time, various individuals added new features to it. less is massive now. That’s why some small embedded systems have more but not less. For comparison, less’s source is over 27000 lines long. more implementations are generally only a little over 2000 lines long.
+The syntaxt of less is shown below:
+less [options] file [...]
+
+head
and tail
The command head
is used to output the first part of files. By default, it outputs the first 10 lines of the file.
head [OPTION]... [FILE]...
+
+Here is an exmaple of printing the first 5 files of the file:
+head -n 5 code_perl/variable_assign.pl
+
+In fact, the letter n does not even need to be used at all. Just the hyphen and the integer (with no intervening space) are sufficient to tell head how many lines to return. Thus, the following would produce the same result as the above commands:
+head -5 target_file.txt
+
+The command tail
is used to output the last part of files. By default, it prints the last 10 lines of the file to standard output.
The syntax is shown below:
+tail [OPTION]... [FILE]...
+
+Here is an exmaple of printing the last 5 files of the file:
+tail -5 target_file.txt
+
+To view lines from a specific point in a file, you can use -n +NUMBER
with the tail
command. For example, here is an example of viewing the file from the 2nd line of the line.
tail -n +2 target_file.txt
+
+
+
+In most Shell environment, programmable completion feature will also improve your speed of typing. It permits typing a partial name of command or a partial file (or directory), then pressing TAB
key to auto-complete the command. If there are more than one possible completions, then TAB will list all of them.
A handy autocomplete feature also exists. Type one or more letters, press the Tab key twice, and then a list of functions starting with these letters appears. For example: type so
, press the Tab
key twice, and then you get the list as:
soelim sort sotruss soundstretch source
+
+Demonstration of programmable completion feature.
+In Linux, file permissions are a vital aspect of system security and resource management. This is particularly important in bioinformatics, where large datasets and scripts are often shared across teams. Permissions determine who can read, write, or execute a file, ensuring that critical data is not accidentally modified or deleted.
+Three Permission Categories:
+Permission Types :
+%%bash
+groups $USER animako eunal gstewar1 mjames17 mjeakle nmilza rahooper
+
+xie186 : zt-bioi611 zt-bioi611_mgr
+animako : zt-bioi611
+eunal : zt-bioi611
+gstewar1 : zt-bioi611
+mjames17 : zt-bioi611
+mjeakle : zt-bioi611
+nmilza : zt-bioi611
+rahooper : zt-bioi611
+
+%%bash
+mkdir -p ~/test_permission/
+touch ~/test_permission/test.txt
+ls -l ~/test_permission/
+rm -rf ~/test_permission/
+
+total 0
+-rw-r--r--. 1 xie186 zt-bioi611 0 Sep 8 22:52 test.txt
+
+Here, the first character represents the type of file (e.g., -
for a regular file or d
for a directory), followed by three groups of three characters, each representing the permissions for the user
, group
, and others
, respectively.
Examples:
+-rwxr-xr--
: The owner
has read
, write
, and execute
permissions. The group has read
and execute
permissions, while others can only read the file.
+drwxr-x---
: A directory where the owner can read, write, and access (execute). The group can only read and access, while others have no permissions.
Modify file permissions using the chmod
command. Permissions can be set in two ways:
Symbolic Mode:
+In symbolic mode, you modify permissions by referencing the categories (user, group, other) and specifying whether you're adding (+), removing (-), or setting (=) permissions.
+# Add execute permission for the user:
+chmod u+x filename
+# Remove write permission for the group:
+chmod g-w filename
+# Set read-only permission for others:
+chmod o=r filename
+
+Symbolic mode is intuitive and flexible, especially when you want to make precise adjustments to permissions without affecting other categories. This is useful for common file-sharing tasks in bioinformatics where you need to tweak access for specific collaborators.
+Numeric Mode (Octal representation):
+In numeric mode, file permissions are set using a three-digit number. Each digit represents the permissions for
user,
group, and
other, respectively. The digits are calculated by adding the values of the
read,
write, and
execute` permissions:
Example Permission Breakdown:
+Read (r), Write (w), and Execute (x) for user = 7
+Read (r) and Execute (x) for group = 5
+Read (r) only for others = 4
+chmod 754 filename
+
+An example to help you understand executable
:
%%bash
+printf '#!/user/bin/python\nprint("Hello, Welcome to Course BIOI611!")' > ~/test.py
+
+%%bash
+ls -l ~/test.py
+python ~/test.py
+
+-rw-r--r--. 1 xie186 zt-bioi611 61 Sep 8 23:06 /home/xie186/test.py
+Hello, Welcome to Course BIOI611!
+
+Error message below will be thrown out if you consider ~/test.py
as a program:
bash: line 1: /home/xie186/test.py: No such file or directory
+
+%%bash
+chmod u+x ~/test.py
+ls -l ~/test.py
+python ~/test.py
+rm ~/test.py
+
+-rwxr--r--. 1 xie186 zt-bioi611 61 Sep 8 23:06 /home/xie186/test.py
+Hello, Welcome to Course BIOI611!
+
+The Linux du
(short for Disk Usage) is a standard Unix/Linux command, used to check the information of disk usage of files and directories on a machine. The du command has many parameter options that can be used to get the results in many formats. The du
command also displays the files and directory sizes in a recursively manner.
%%bash
+du -h ~/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
+
+2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
+
+%%bash
+du -ah ~/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
+
+2.9M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbList.fromGTF.out.tab
+7.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/Log.out
+936M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/SA
+1.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/SAindex
+3.0M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/transcriptInfo.tab
+2.3M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbList.out.tab
+1.5M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/geneInfo.tab
+1.0K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/genomeParameters.txt
+512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrLength.txt
+512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrNameLength.txt
+512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrStart.txt
+7.6M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/exonGeTrInfo.tab
+3.1M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/exonInfo.tab
+2.8M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbInfo.txt
+512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrName.txt
+119M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/Genome
+2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
+
+%%bash
+du -csh /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/*
+
+19G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/raw_data
+0 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/raw_data_smart_seq
+1.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_data.sub
+575K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq-7478223-xie186.err
+0 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq-7478223-xie186.out
+8.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq.sub
+2.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s2_star.sub
+34G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_align
+2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref
+512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/test.sub
+512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/test.txt
+55G total
+
+Symbolic link, similar to shortcuts, can point to another file/folder.
+ln -s <path_to_files/folder_to_be_linked> <symlink_to_be_created>
+ls -l <symlink>
+unlink <symlink>
+
+
+Compress one file:
+%%bash
+perl -e 'for($i=0; $i<10000; ++$i){ print "test\n";}' > test.txt
+du -h test.txt
+gzip test.txt
+du -h test.txt.gz
+gunzip test.txt
+ls test.txt
+rm test.txt
+
+52K test.txt
+4.0K test.txt.gz
+test.txt
+
+Compress multiple files:
+
+%%bash
+perl -e 'for($i=0; $i<10000; ++$i){ print "test\n";}' > test1.txt
+perl -e 'for($i=0; $i<10000; ++$i){ print "test\n";}' > test2.txt
+du -h test1.txt test2.txt
+tar zcvf test.tar.gz test1.txt test2.txt
+du -sh test.tar.gz
+ls test1.txt test2.txt
+
+52K test1.txt
+52K test2.txt
+test1.txt
+test2.txt
+4.0K test.tar.gz
+test1.txt
+test2.txt
+
+z
: This option tells tar to compress the archive using gzip. The resulting archive will have a .gz extension to indicate that it has been compressed with the gzip utility.
c
: This option stands for create. It instructs tar to create a new archive.
v
: This stands for verbose. When used, tar will display detailed information about the files being added to the archive, such as their names.
f
: This stands for file. It tells tar that the next argument (test.tar.gz) is the name of the archive file to create.
%%bash
+tar tvf test.tar.gz
+rm test.tar.gz test1.txt test2.txt
+
+-rw-r--r-- xie186/zt-bioi611 50000 2024-08-25 21:52 test1.txt
+-rw-r--r-- xie186/zt-bioi611 50000 2024-08-25 21:52 test2.txt
+
+t
: List the contents of archive.tar.
v
: Display additional details about each file (like file permissions, size, and modification date).
f
: Specifies that archive.tar is the archive file to operate on.
To uncompress a tar.gz
file, use tar zxvf
:
tar zxvf test.tar.gz
+
+Basic Syntax of scp
:
scp [options] source destination
+
+
+Copy a Local File to a Remote Server
+
+scp file.txt username@remote_host:/path/to/destination/
+
+Alternative command is rsync
.
find
The find
command is designed for comprehensive file and directory sesarches.
find [path] [options] [expression]
+
+%%bash
+find /home/xie186/scratch/bioi611/bulk_RNAseq -name "*.fastq.gz"
+
+/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day7_rep3.fastq.gz
+/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day1_rep3.fastq.gz
+/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day1_rep1.fastq.gz
+/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day7_rep1.fastq.gz
+/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day1_rep2.fastq.gz
+/home/xie186/scratch/bioi611/bulk_RNAseq/raw_data/N2_day7_rep2.fastq.gz
+
+wc
%%bash
+find /home/xie186/scratch/bioi611/bulk_RNAseq -name "*.fastq.gz" |wc -l
+
+6
+
+|
In Linux and Unix-based systems, the pipe (|
) is used in the command line to redirect the output of one command as the input to another command. This allows you to chain commands together and perform more complex tasks in a single line.
%%bash
+grep '>' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa |wc -l
+
+7
+
+Command cut
can be used to print selected parts of lines from each FILE to standard output.
%%bash
+wget -O GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz "https://ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE102537&format=file&file=GSE102537_raw_counts_GRCh38.p13_NCBI.tsv.gz"
+
+--2024-08-25 21:08:03-- https://ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE102537&format=file&file=GSE102537_raw_counts_GRCh38.p13_NCBI.tsv.gz
+Resolving ncbi.nlm.nih.gov (ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
+Connecting to ncbi.nlm.nih.gov (ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
+HTTP request sent, awaiting response... 200 OK
+Length: 349584 (341K) [application/octet-stream]
+Saving to: ‘GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz’
+
+ 0K .......... .......... .......... .......... .......... 14% 6.66M 0s
+ 50K .......... .......... .......... .......... .......... 29% 16.9M 0s
+ 100K .......... .......... .......... .......... .......... 43% 27.5M 0s
+ 150K .......... .......... .......... .......... .......... 58% 10.1M 0s
+ 200K .......... .......... .......... .......... .......... 73% 17.2M 0s
+ 250K .......... .......... .......... .......... .......... 87% 37.6M 0s
+ 300K .......... .......... .......... .......... . 100% 10.5M=0.02s
+
+2024-08-25 21:08:04 (13.4 MB/s) - ‘GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz’ saved [349584/349584]
+
+%%bash
+zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |head
+
+GeneID GSM2740270 GSM2740272 GSM2740273 GSM2740274 GSM2740275
+100287102 9 17 14 14 19
+653635 336 470 467 310 370
+102466751 8 56 46 31 31
+107985730 0 2 2 3 3
+100302278 0 1 0 0 2
+645520 0 3 8 4 7
+79501 0 2 2 1 4
+100996442 16 25 34 20 28
+729737 19 39 33 22 26
+
+%%bash
+zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |cut -f1,2,3 |head
+
+GeneID GSM2740270 GSM2740272
+100287102 9 17
+653635 336 470
+102466751 8 56
+107985730 0 2
+100302278 0 1
+645520 0 3
+79501 0 2
+100996442 16 25
+729737 19 39
+
+%%bash
+grep '>' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa
+
+>I dna:chromosome chromosome:WBcel235:I:1:15072434:1 REF
+>II dna:chromosome chromosome:WBcel235:II:1:15279421:1 REF
+>III dna:chromosome chromosome:WBcel235:III:1:13783801:1 REF
+>IV dna:chromosome chromosome:WBcel235:IV:1:17493829:1 REF
+>V dna:chromosome chromosome:WBcel235:V:1:20924180:1 REF
+>X dna:chromosome chromosome:WBcel235:X:1:17718942:1 REF
+>MtDNA dna:chromosome chromosome:WBcel235:MtDNA:1:13794:1 REF
+
+%%bash
+zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |wc -l
+zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |awk '$2>500' |wc -l
+zcat GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz |awk '$2>500 && $3>500' |wc -l
+
+39377
+8773
+3820
+
+%%bash
+grep '>' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa |sed 's/>//' |sed 's/ .*//'
+
+I
+II
+III
+IV
+V
+X
+MtDNA
+
+Regular expressions are sequences of characters that define search patterns. They are commonly used for string matching, searching, and text processing.
+Regex is used in text editors, programming languages, command-line tools (likegrep
and sed
), and many bioinformatics tools to search, replace, or extract data from text.
.
(dot): Matches any single character except a newline.
+Example: A.G
matches "AAG", "ATG", "ACG", etc.^
: Matches the start of a line.
+Example: ^A
matches any line starting with "A".
$
: Matches the end of a line.
+Example: end$
matches any line ending with "end".
*
: Matches 0 or more occurrences of the preceding character.
+Example: ca*t
matches "ct", "cat", "caat", "caaat", etc.
+
: Matches 1 or more occurrences of the preceding character.
+Example: ca+t
matches "cat", "caat", "caaat", etc.
?
: Matches 0 or 1 occurrence of the preceding character.
+Example: colou?r
matches both "color" and "colour".
[]
: Matches any one of the characters inside the brackets.
+Example: [aeiou]
matches any vowel.
|
: Alternation (OR) operator.
+Example: cat|dog
matches either "cat" or "dog".
\d
: Matches any digit (equivalent to [0-9]).
\w
: Matches any word character (alphanumeric or underscore).
\s
: Matches any whitespace character (spaces, tabs, etc.).
\D
: Matches any non-digit character.
\W
: Matches any non-word character.
\S
: Matches any non-whitespace character.
{n}
: Matches exactly n occurrences.
+Example: A{3} matches "AAA".
{n,}
: Matches n or more occurrences.
+Example: T{2,} matches "TT", "TTT", "TTTT", etc.
{n,m}
: Matches between n and m occurrences.
+Example: G{1,3} matches "G", "GG", or "GGG".
%%bash
+grep -v '#' ~/scratch/bioi611/reference/Caenorhabditis_elegans.WBcel235.111.gtf \
+ |awk '$3=="gene"' \
+ |sed 's/.*gene_biotype "//' \
+ |sed 's/";//'|sort |uniq -c \
+ | sort -k1,1n
+
+ 22 rRNA
+ 100 antisense_RNA
+ 129 snRNA
+ 194 lincRNA
+ 261 miRNA
+ 346 snoRNA
+ 634 tRNA
+ 2128 pseudogene
+ 7764 ncRNA
+ 15363 piRNA
+ 19985 protein_coding
+
+Environment variables are dynamic values that affect the behavior of processes and programs in Linux. They are commonly used to store configuration data and are essential in bioinformatics workflows for defining paths to software, libraries, and datasets.
+PATH
:The PATH
variable specifies directories where the system looks for executable files when a command is run.
%%bash
+echo $PATH
+
+/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/texlive/bin/x86_64-linux:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/imagemagick/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/graphviz/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/ghostscript/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/ffmpeg/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/mpi-nocuda/linux-rhel8-zen2/gcc/11.3.0/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/nompi-nocuda/linux-rhel8-zen2/gcc/11.3.0/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/compiler/linux-rhel8-zen2/gcc/11.3.0/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/linux-rhel8-x86_64/gcc-rh8-8.5.0/gcc-11.3.0-oedkmii7vhd6rbnqm6xufmg7d3jx4w6l/bin:/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/linux-rhel8-zen2/gcc-11.3.0/py-jupyter-1.0.0-trwwgzwljql55mhmaygcuxb3nvaevjsu/bin:/software/acigs-utilities/bin:/home/xie186/miniforge3/bin:/home/xie186/miniforge3/condabin:/home/xie186/SHELL.bioi611/software/STAR_2.7.11b/Linux_x86_64_static:/home/xie186/.local/bin:/home/xie186/bin:/software/acigs-utilities/bin:/usr/share/Modules/bin:/usr/lib/heimdal/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/symas/bin:/opt/dell/srvadmin/bin
+
+HOME
:The HOME
variable stores the path to the user’s home directory.
%%bash
+echo $HOME
+
+/home/xie186
+
+%%bash
+echo $SHELL
+
+/bin/bash
+
+Temporarily setting a variable (valid only for the current shell session):
+export PATH=value:PATH
+
+Permanently setting a variable:
+To make the environment variable persistent across sessions,
+it needs to be added to configuration files like .bashrc
or .bash_profile
.
+Example: Add the following line to .bashrc
:
Conda is a popular package management system, especially in bioinformatics, +due to its ability to create isolated environments. This is crucial when working with tools that have conflicting dependencies.
+conda/miniforge
Go to: https://github.com/conda-forge/miniforge/releases +Download the corresponding installtion file
+%%bash
+uname -m
+
+x86_64
+
+wget https://github.com/conda-forge/miniforge/releases/download/24.7.1-0/Mambaforge-24.7.1-0-Linux-x86_64.sh
+
+conda create -n bioi611
+conda activate bioi611
+conda install bioconda::fastqc==0.11.8
+
+git clone https://github.com/lh3/bwa.git
+cd bwa; make
+./bwa index ref.fa
+
+https://hub.docker.com/r/biocontainers/bwa/
+module load singularity
+singularity build bwa_v0.7.17_cv1.sif docker://biocontainers/bwa:v0.7.17_cv1
+
+In Linux, we sometimes need to create or edit a text file like writing a new perl script. So we need to use text editor.
+As a newbie, someone would prefer a basic, GUI-based text editor with menus and traditional CUA key bindings. Here we recommend Sublime, ATOM and Notepad++.
+But GUI-based text editor is not always available in Linux.
+A powerful screen text editor vi
(pronounced “vee-eye”) is available on nearly all Linux system. We highly recommend vi
as a text editor, because something we’ll have to edit a text file on a system without a friendlier text editor. Once we get familiar with vi
, we’ll find that it’s very fast and powerful.
But remember, it’s OK if you think this part is too difficult at the beginning. You can use either Sublime
, ATOM
or Notepad++
. If you are connecting to a Linux system without Sublime
, ATOM
and Notepad++
, you can write the file in a local computer and then upload the file onto Linux system.
vi
skillsAs vi
uses a lot of combination of keystrokes, it may be not easy for newbies to remember all the combinations in one fell swoop. Considering this, we’ll first introduce the basic skills someone needs to know to use vi
. We need to first understand how three modes of vi
work and then try to remember a few basic vi
commonds. Then we can use these skills to write Perl or R scripts in the following chaptors for Perl and R (Figure \@ref(fig:workingModeVi)).
Three modes of vi
:
vi
mkdir test_vi ## generate a new folder
+cd test_vi ## go into the new folder
+echo "Using \`ls\` we don't expect files in this folder."
+ls
+echo "No file displayed!"
+
+Using the code above, we made a new directory named test_vi
. We didn't see any file.
If we type vi test.py
, an empty file and screen are created into which you may enter text because the file does not exist((Figure \@ref(fig:ViNewFile))).
vi test.py
+
+A screentshot of the vi test.py
.
Now if you are in vi mode
. To go to Input mode
, you can type i
, 'a' or 'o' (Figure \@ref(fig:ViInpuMode)).
A screentshot of the vi test.py
.
Now you can type the content (codes or other information) (\@ref(fig:ViInpuType)).
+Once you are done typing. You need to go to Command mode
(Figure \@ref(fig:workingModeVi)) if you want to save and exit the file. To do this, you need to press ESC
button on the keyboard.
Now we just wrote a Perl script. We can run this script.
+
+python test.py
+
+HPC resources enable bioinformatics analyses that require significant computational power and memory.
+An example of an job file (s1_star.sh
):
#!/bin/bash
+#SBATCH --partition=standard
+#SBATCH -t 40:00:00
+#SBATCH -n 1
+#SBATCH -c 20
+#SBATCH --job-name=s1_star_aln
+#SBATCH --mail-type=FAIL,BEGIN,END
+#SBATCH --error=%x-%J-%u.err
+#SBATCH --output=%x-%J-%u.out
+conda activate bioi611
+mkdir -p STAR_align/
+STAR --genomeDir STAR_ref \
+ --outSAMtype BAM SortedByCoordinate \
+ --twopassMode Basic \
+ --quantMode TranscriptomeSAM GeneCounts \
+ --readFilesCommand zcat \
+ --outFileNamePrefix STAR_align/N2_day1_rep1. \
+ --runThreadN 20 \
+ --readFilesIn raw_data/N2_day1_rep1.fastq.gz
+
+To submit this job, run:
+sbatch s1_star.sh
+
+%%bash
+scratch_quota
+# shell_quota
+
+# Group quotas
+ Group name Space used Space quota % quota used
+ zt-bioi611 285.811 MB 4.000 TB 0.01%
+ zt-bioi611_mgr 98.163 GB unlimited 0
+ total 98.449 GB unlimited 0
+# User quotas
+ User name Space used Space quota % quota used % of GrpTotal
+ xie186 98.449 GB unlimited 0 100.00%
+
+%%bash
+sinfo
+
+PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
+debug up 15:00 1 maint compute-b8-60
+debug up 15:00 1 drng compute-b8-57
+debug up 15:00 1 mix compute-b8-59
+debug up 15:00 1 alloc compute-b8-58
+scavenger up 14-00:00:0 1 inval compute-b8-48
+scavenger up 14-00:00:0 4 drain$ compute-b8-[53-56]
+scavenger up 14-00:00:0 84 maint compute-a7-[5,9,14-16,28,49],compute-a8-[2-4,8-9,15,18,22,24,29,37,44,51],compute-b5-[4,16,26,29-30,33,44,51-52],compute-b6-[7,12,21,28-29,32,34,43-46,50-51,59],compute-b7-[12-13,19-22,25,27,29,31,35,37,39,42,45-46,49-50,54,56-59],compute-b8-[16,19,21,23-24,29,32,35-37,39-45,60]
+scavenger up 14-00:00:0 2 drain* compute-a7-[13,43]
+scavenger up 14-00:00:0 13 drng compute-a8-[7,14],compute-b7-[14-15,18,38,43-44],compute-b8-[2,20,51,57],gpu-b9-5
+scavenger up 14-00:00:0 2 drain compute-a7-8,gpu-b10-5
+scavenger up 14-00:00:0 182 mix bigmem-a9-[1-2,4-5],compute-a5-[3-11],compute-a7-[2-3,6-7,10,12,17-19,21-22,30,38-40,45-46,48,54-56,60],compute-a8-[5-6,10-12,16-17,19-21,25,28,31-35,39,41,45,47,50,52,54,57-59],compute-b5-[1-3,5-8,11,13-15,17-25,27-28,31-32,34-43,45-50,53-55,57-58],compute-b6-[1-5,14-15,17-20,22-24,35-36,48-49,52,54],compute-b7-[1,7-8,16-17,23-24,26,28,30,32-34,36,40-41,47-48,51-52,55,60],compute-b8-[1,15,17-18,22,25-27,30-31,33,46-47,49-50,59],gpu-b9-[1-4,6-7],gpu-b10-[1-3,6-7],gpu-b11-[1-6]
+scavenger up 14-00:00:0 93 alloc bigmem-a9-[3,6],compute-a7-[1,4,11,20,23-27,29,31-37,41-42,44,47,50-53,57-59],compute-a8-[1,13,23,26-27,30,36,38,40,42-43,46,48-49,53,55-56,60],compute-b5-[9-10,12,56,59-60],compute-b6-[6,8-11,13,16,27,30-31,58,60],compute-b7-[2-6,9-11,53],compute-b8-[3-14,28,34,38,52,58],gpu-b10-4
+scavenger up 14-00:00:0 14 idle compute-b6-[25-26,33,37-42,47,53,55-57]
+standard* up 7-00:00:00 1 inval compute-b8-48
+standard* up 7-00:00:00 4 drain$ compute-b8-[53-56]
+standard* up 7-00:00:00 82 maint compute-a7-[5,9,14-16,28,49],compute-a8-[2-4,8-9,15,18,22,24,29,37,44,51],compute-b5-[4,16,26,29-30,33,44,51-52],compute-b6-[7,12,21,28-29,32,34,43-46,50-51],compute-b7-[12-13,19-22,25,27,29,31,35,37,39,42,45-46,49-50,54,56-59],compute-b8-[16,19,21,23-24,29,32,35-37,39-45]
+standard* up 7-00:00:00 2 drain* compute-a7-[13,43]
+standard* up 7-00:00:00 11 drng compute-a8-[7,14],compute-b7-[14-15,18,38,43-44],compute-b8-[2,20,51]
+standard* up 7-00:00:00 1 drain compute-a7-8
+standard* up 7-00:00:00 159 mix compute-a5-[3-11],compute-a7-[2-3,6-7,10,12,17-19,21-22,30,38-40,45-46,48,54-56,60],compute-a8-[5-6,10-12,16-17,19-21,25,28,31-35,39,41,45,47,50,52,54,57-59],compute-b5-[1-3,5-8,11,13-15,17-25,27-28,31-32,34-43,45-50,53-55,57-58],compute-b6-[1-5,14-15,17-20,22-24,35-36,48-49,52],compute-b7-[1,7-8,16-17,23-24,26,28,30,32-34,36,40-41,47-48,51-52,55,60],compute-b8-[1,15,17-18,22,25-27,30-31,33,46-47,49-50]
+standard* up 7-00:00:00 87 alloc compute-a7-[1,4,11,20,23-27,29,31-37,41-42,44,47,50-53,57-59],compute-a8-[1,13,23,26-27,30,36,38,40,42-43,46,48-49,53,55-56,60],compute-b5-[9-10,12,56,59-60],compute-b6-[6,8-11,13,16,27,30-31],compute-b7-[2-6,9-11,53],compute-b8-[3-14,28,34,38,52]
+standard* up 7-00:00:00 10 idle compute-b6-[25-26,33,37-42,47]
+serial up 14-00:00:0 1 maint compute-b6-59
+serial up 14-00:00:0 1 mix compute-b6-54
+serial up 14-00:00:0 2 alloc compute-b6-[58,60]
+serial up 14-00:00:0 4 idle compute-b6-[53,55-57]
+gpu up 7-00:00:00 1 down$ gpu-a6-3
+gpu up 7-00:00:00 1 drng gpu-b9-5
+gpu up 7-00:00:00 1 drain gpu-b10-5
+gpu up 7-00:00:00 19 mix gpu-a6-[6,8],gpu-b9-[1-4,6-7],gpu-b10-[1-3,6-7],gpu-b11-[1-6]
+gpu up 7-00:00:00 1 alloc gpu-b10-4
+gpu up 7-00:00:00 6 idle gpu-a5-1,gpu-a6-[2,4-5,7,9]
+bigmem up 7-00:00:00 4 mix bigmem-a9-[1-2,4-5]
+bigmem up 7-00:00:00 2 alloc bigmem-a9-[3,6]
+
+%%bash
+scontrol show partition standard
+
+PartitionName=standard
+ AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
+ AllocNodes=ALL Default=YES QoS=N/A
+ DefaultTime=00:15:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
+ MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
+ Nodes=compute-a5-[3-11],compute-a7-[1-60],compute-a8-[1-60],compute-b5-[1-60],compute-b6-[1-52],compute-b7-[1-60],compute-b8-[1-56]
+ PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
+ OverTimeLimit=NONE PreemptMode=REQUEUE
+ State=UP TotalCPUs=45696 TotalNodes=357 SelectTypeParameters=NONE
+ JobDefaults=(null)
+ DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
+ TRES=cpu=45696,mem=178500G,node=357,billing=45696
+ TRESBillingWeights=CPU=1.0,Mem=0.25G
+
+%%bash
+scontrol show node compute-a5-3
+
+NodeName=compute-a5-3 Arch=x86_64 CoresPerSocket=64
+ CPUAlloc=71 CPUEfctv=128 CPUTot=128 CPULoad=68.89
+ AvailableFeatures=rhel8,amd,epyc_7702,ib
+ ActiveFeatures=rhel8,amd,epyc_7702,ib
+ Gres=(null)
+ NodeAddr=compute-a5-3 NodeHostName=compute-a5-3 Version=23.11.9
+ OS=Linux 4.18.0-553.5.1.el8_10.x86_64 #1 SMP Tue May 21 03:13:04 EDT 2024
+ RealMemory=512000 AllocMem=296960 FreeMem=326630 Sockets=2 Boards=1
+ State=MIXED ThreadsPerCore=1 TmpDisk=300000 Weight=1 Owner=N/A MCS_label=N/A
+ Partitions=scavenger,standard
+ BootTime=2024-08-08T18:32:48 SlurmdStartTime=2024-08-12T17:43:23
+ LastBusyTime=2024-08-12T17:43:19 ResumeAfterTime=None
+ CfgTRES=cpu=128,mem=500G,billing=128
+ AllocTRES=cpu=71,mem=290G
+ CapWatts=n/a
+ CurrentWatts=630 AveWatts=294
+ ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
+
+CPU Details: +* Total CPUs: 128 +* Allocated CPUs: 71
+Memory: +* Total Memory: 500 GB +* Allocated Memory: 290 GB +* Free Memory: ~319 GB
+%%bash
+squeue -u $USER
+
+ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
+ 7563417 standard sys/dash xie186 R 48:15 1 compute-a5-5
+
+%%bash
+scancel <JOBID>
+
+
+ # @hidden_cell
+import os
+os.chdir('/')
+
+To download the reference for this lab, we use ENSEMBL database.
+In ENSEMBL database, each species may have different releases of genome build. We use release-111
in this project.
The genome sequences can be obtained from the link below: +https://ftp.ensembl.org/pub/release-111/fasta/caenorhabditis_elegans/dna/
+The genoe anntation file in gtf format can be obtained here: +https://ftp.ensembl.org/pub/release-111/gtf/caenorhabditis_elegans/
+%%bash
+wget -O Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-111/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
+gunzip Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
+
+%%bash
+## A *fai file will be generated
+samtools faidx ref/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa
+
+%%bash
+wget -O Caenorhabditis_elegans.WBcel235.111.gtf.gz -nv https://ftp.ensembl.org/pub/release-111/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.111.gtf.gz
+gunzip Caenorhabditis_elegans.WBcel235.111.gtf.gz
+
+In this course, the reference files have been downloaded and stored in shared folder for BIOI611: +/scratch/zt1/project/bioi611/shared/reference/
+As you already leart, you can create a symbolic link for you to use in your scratch folder:
+%%bash
+cd /scratch/zt1/project/bioi611/user/$USER
+ln -s /scratch/zt1/project/bioi611/shared/reference/ .
+
+%%bash
+cd /scratch/zt1/project/bioi611/user/$USER
+grep '>' reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa
+
+>I dna:chromosome chromosome:WBcel235:I:1:15072434:1 REF
+>II dna:chromosome chromosome:WBcel235:II:1:15279421:1 REF
+>III dna:chromosome chromosome:WBcel235:III:1:13783801:1 REF
+>IV dna:chromosome chromosome:WBcel235:IV:1:17493829:1 REF
+>V dna:chromosome chromosome:WBcel235:V:1:20924180:1 REF
+>X dna:chromosome chromosome:WBcel235:X:1:17718942:1 REF
+>MtDNA dna:chromosome chromosome:WBcel235:MtDNA:1:13794:1 REF
+
+%%bash
+cd /scratch/zt1/project/bioi611/user/$USER
+
+grep -v '#' reference/Caenorhabditis_elegans.WBcel235.111.gtf \
+ |awk '$3=="gene"' \
+ |sed 's/.*gene_biotype "//' \
+ |sed 's/";//'|sort |uniq -c \
+ | sort -k1,1n
+
+ 22 rRNA
+ 100 antisense_RNA
+ 129 snRNA
+ 194 lincRNA
+ 261 miRNA
+ 346 snoRNA
+ 634 tRNA
+ 2128 pseudogene
+ 7764 ncRNA
+ 15363 piRNA
+ 19985 protein_coding
+
++Source: https://useast.ensembl.org/Help/Faq?id=468eudogene
+%%bash
+mkdir -p raw_data/
+curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR156/002/SRR15694102/SRR15694102.fastq.gz -o raw_data/N2_day7_rep1.fastq.gz
+curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR156/001/SRR15694101/SRR15694101.fastq.gz -o raw_data/N2_day7_rep2.fastq.gz
+curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR156/000/SRR15694100/SRR15694100.fastq.gz -o raw_data/N2_day7_rep3.fastq.gz
+curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR156/099/SRR15694099/SRR15694099.fastq.gz -o raw_data/N2_day1_rep1.fastq.gz
+curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR156/098/SRR15694098/SRR15694098.fastq.gz -o raw_data/N2_day1_rep2.fastq.gz
+curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR156/097/SRR15694097/SRR15694097.fastq.gz -o raw_data/N2_day1_rep3.fastq.gz
+
+%%bash
+cd /scratch/zt1/project/bioi611/user/$USER
+sbatch ../../shared/scripts/bulkRNA_s1_fastqc.sub
+
+trim galore
to remove adaptors, low quality bases and low quality reads. %%bash
+cd /scratch/zt1/project/bioi611/user/$USER
+sbatch ../../shared/scripts/bulkRNA_s2_trim_galore.sub
+
+
+
+
+
+
+
+ Welcome to BIOI 611! I’m excited to have you in this course, where we will delve into the fascinating world of transcriptomics and explore the intricacies of gene and transcript-level expression analysis.
+This course focuses on the analysis of transcriptomics data, and specifically on the analysis of gene and transcript-level expression. Material covered includes transcript and gene expression estimation from RNA-seq data (short and long-read), basic experimental design and statistical methods for differential expression analysis, discovery of novel transcripts via reference-guided and de novo assembly, and the analysis of single-cell gene expression data (e.g., single-cell expression quantification, dimensionality reduction, clustering, pseudotime analysis). Prerequisite: BIOI 604. Core.
+ +
+
+
+
+
+
+ ' + escapeHtml(summary) +'
' + noResultsText + '
'); + } +} + +function doSearch () { + var query = document.getElementById('mkdocs-search-query').value; + if (query.length > min_search_length) { + if (!window.Worker) { + displayResults(search(query)); + } else { + searchWorker.postMessage({query: query}); + } + } else { + // Clear results for short queries + displayResults([]); + } +} + +function initSearch () { + var search_input = document.getElementById('mkdocs-search-query'); + if (search_input) { + search_input.addEventListener("keyup", doSearch); + } + var term = getSearchTermFromLocation(); + if (term) { + search_input.value = term; + doSearch(); + } +} + +function onWorkerMessage (e) { + if (e.data.allowSearch) { + initSearch(); + } else if (e.data.results) { + var results = e.data.results; + displayResults(results); + } else if (e.data.config) { + min_search_length = e.data.config.min_search_length-1; + } +} + +if (!window.Worker) { + console.log('Web Worker API not supported'); + // load index in main thread + $.getScript(joinUrl(base_url, "search/worker.js")).done(function () { + console.log('Loaded worker'); + init(); + window.postMessage = function (msg) { + onWorkerMessage({data: msg}); + }; + }).fail(function (jqxhr, settings, exception) { + console.error('Could not load worker.js'); + }); +} else { + // Wrap search in a web worker + var searchWorker = new Worker(joinUrl(base_url, "search/worker.js")); + searchWorker.postMessage({init: true}); + searchWorker.onmessage = onWorkerMessage; +} diff --git a/search/search_index.json b/search/search_index.json new file mode 100644 index 0000000..f719dbc --- /dev/null +++ b/search/search_index.json @@ -0,0 +1 @@ +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"BIOI611 lab Welcome to BIOI 611! I\u2019m excited to have you in this course, where we will delve into the fascinating world of transcriptomics and explore the intricacies of gene and transcript-level expression analysis. This course focuses on the analysis of transcriptomics data, and specifically on the analysis of gene and transcript-level expression. Material covered includes transcript and gene expression estimation from RNA-seq data (short and long-read), basic experimental design and statistical methods for differential expression analysis, discovery of novel transcripts via reference-guided and de novo assembly, and the analysis of single-cell gene expression data (e.g., single-cell expression quantification, dimensionality reduction, clustering, pseudotime analysis). Prerequisite: BIOI 604. Core.","title":"BIOI611 lab"},{"location":"#bioi611-lab","text":"Welcome to BIOI 611! I\u2019m excited to have you in this course, where we will delve into the fascinating world of transcriptomics and explore the intricacies of gene and transcript-level expression analysis. This course focuses on the analysis of transcriptomics data, and specifically on the analysis of gene and transcript-level expression. Material covered includes transcript and gene expression estimation from RNA-seq data (short and long-read), basic experimental design and statistical methods for differential expression analysis, discovery of novel transcripts via reference-guided and de novo assembly, and the analysis of single-cell gene expression data (e.g., single-cell expression quantification, dimensionality reduction, clustering, pseudotime analysis). Prerequisite: BIOI 604. Core.","title":"BIOI611 lab"},{"location":"FASTQ_PHRED/","text":"What is PHRED Scores A Phred score is a measure of the probability that a base call in a DNA sequencing read is incorrect. It is a logarithmic scale, meaning that a small change in the Phred score represents a large change in the probability of an error. $$Q = -10 \\cdot \\log_{10}(P)$$ Where: Q is the PHRED score. P is the probability that the base was called incorrectly. For example: Q = 20 : This corresponds to a 1 in 100 probability of an incorrect base call, or an accuracy of 99%. Q = 30 : This corresponds to a 1 in 1000 probability of an incorrect base call, or an accuracy of 99.9%. Q = 40 : This corresponds to a 1 in 10,000 probability of an incorrect base call, or an accuracy of 99.99%. # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"Phred\", \"Prob of\")) cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"score\", \"Incorrect call\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { cat(sprintf(\"%-5d\\t\\t%0.5f\\n\", phred, 10^(phred / -10))) } What is ASCII ASCII (American Standard Code for Information Interchange) is used to represent characters in computers. We can represent Phred scores using ASCII characters. The advantage is that the quality information can be esisly stored in text based FASTQ file. Not all ASCII characters are printable. The first printable ASCII character is ! and the decimal code for the character for ! is 33. # Store output in a vector to fit on a slide output <- c(sprintf(\"%-8s %-8s\", \"Character\", \"ASCII #\")) # Loop through ASCII values from 33 to 89 for (i in 33:89) { output <- c(output, sprintf(\"%-8s %-8d\", intToUtf8(i), i)) } # Print the output in a single block (e.g., to fit on a slide) cat(paste(output, collapse = \"\\n\")) Phred scores in FASTQ file In a FASTQ file, Phred scores are represented as ASCII characters. These characters are converted back to numeric values (PHRED scores) based on the encoding scheme used: PHRED+33 Encoding (Sanger/Illumina 1.8+) : The ASCII character for a quality score Q is calculated as: ASCII character=chr(Q+33) For example: A PHRED score of 30 is encoded as chr(30 + 33) = chr(63) , which corresponds to the ASCII character ? . PHRED+64 Encoding (Illumina 1.3-1.7) : The ASCII character for a quality score QQQ is calculated as: ASCII character=chr(Q+64) For example: A PHRED score of 30 is encoded as chr(30 + 64) = chr(94) , which corresponds to the ASCII character ^ . # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t\\t%-10s\\n\", \"Phred\", \"Prob. of\", \"ASCII\", \"ASCII\")) cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t%-10s\\n\", \"score\", \"Error\", \"Phred+33\", \"Phred+64\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { # Calculate the probability of error prob_error <- 10^(phred / -10) # Convert Phred scores to ASCII characters ascii_phred33 <- intToUtf8(phred + 33) ascii_phred64 <- intToUtf8(phred + 64) # Print the results in a formatted table cat(sprintf(\"%-5d\\t\\t%0.5f\\t\\t%-6s\\t\\t%-10s\\n\", phred, prob_error, ascii_phred33, ascii_phred64)) }","title":"PRED Score in Bioinformatics"},{"location":"FASTQ_PHRED/#what-is-phred-scores","text":"A Phred score is a measure of the probability that a base call in a DNA sequencing read is incorrect. It is a logarithmic scale, meaning that a small change in the Phred score represents a large change in the probability of an error. $$Q = -10 \\cdot \\log_{10}(P)$$ Where: Q is the PHRED score. P is the probability that the base was called incorrectly. For example: Q = 20 : This corresponds to a 1 in 100 probability of an incorrect base call, or an accuracy of 99%. Q = 30 : This corresponds to a 1 in 1000 probability of an incorrect base call, or an accuracy of 99.9%. Q = 40 : This corresponds to a 1 in 10,000 probability of an incorrect base call, or an accuracy of 99.99%. # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"Phred\", \"Prob of\")) cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"score\", \"Incorrect call\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { cat(sprintf(\"%-5d\\t\\t%0.5f\\n\", phred, 10^(phred / -10))) }","title":"What is PHRED Scores"},{"location":"FASTQ_PHRED/#what-is-ascii","text":"ASCII (American Standard Code for Information Interchange) is used to represent characters in computers. We can represent Phred scores using ASCII characters. The advantage is that the quality information can be esisly stored in text based FASTQ file. Not all ASCII characters are printable. The first printable ASCII character is ! and the decimal code for the character for ! is 33. # Store output in a vector to fit on a slide output <- c(sprintf(\"%-8s %-8s\", \"Character\", \"ASCII #\")) # Loop through ASCII values from 33 to 89 for (i in 33:89) { output <- c(output, sprintf(\"%-8s %-8d\", intToUtf8(i), i)) } # Print the output in a single block (e.g., to fit on a slide) cat(paste(output, collapse = \"\\n\"))","title":"What is ASCII"},{"location":"FASTQ_PHRED/#phred-scores-in-fastq-file","text":"In a FASTQ file, Phred scores are represented as ASCII characters. These characters are converted back to numeric values (PHRED scores) based on the encoding scheme used: PHRED+33 Encoding (Sanger/Illumina 1.8+) : The ASCII character for a quality score Q is calculated as: ASCII character=chr(Q+33) For example: A PHRED score of 30 is encoded as chr(30 + 33) = chr(63) , which corresponds to the ASCII character ? . PHRED+64 Encoding (Illumina 1.3-1.7) : The ASCII character for a quality score QQQ is calculated as: ASCII character=chr(Q+64) For example: A PHRED score of 30 is encoded as chr(30 + 64) = chr(94) , which corresponds to the ASCII character ^ . # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t\\t%-10s\\n\", \"Phred\", \"Prob. of\", \"ASCII\", \"ASCII\")) cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t%-10s\\n\", \"score\", \"Error\", \"Phred+33\", \"Phred+64\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { # Calculate the probability of error prob_error <- 10^(phred / -10) # Convert Phred scores to ASCII characters ascii_phred33 <- intToUtf8(phred + 33) ascii_phred64 <- intToUtf8(phred + 64) # Print the results in a formatted table cat(sprintf(\"%-5d\\t\\t%0.5f\\t\\t%-6s\\t\\t%-10s\\n\", phred, prob_error, ascii_phred33, ascii_phred64)) }","title":"Phred scores in FASTQ file"},{"location":"Phred_FQ/","text":"What is PHRED Scores A Phred score is a measure of the probability that a base call in a DNA sequencing read is incorrect. It is a logarithmic scale, meaning that a small change in the Phred score represents a large change in the probability of an error. $$Q = -10 \\cdot \\log_{10}(P)$$ Where: Q is the PHRED score. P is the probability that the base was called incorrectly. For example: Q = 20 : This corresponds to a 1 in 100 probability of an incorrect base call, or an accuracy of 99%. Q = 30 : This corresponds to a 1 in 1000 probability of an incorrect base call, or an accuracy of 99.9%. Q = 40 : This corresponds to a 1 in 10,000 probability of an incorrect base call, or an accuracy of 99.99%. # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"Phred\", \"Prob of\")) cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"score\", \"Incorrect call\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { cat(sprintf(\"%-5d\\t\\t%0.5f\\n\", phred, 10^(phred / -10))) } Phred Prob of score Incorrect call 0 1.00000 1 0.79433 2 0.63096 3 0.50119 4 0.39811 5 0.31623 6 0.25119 7 0.19953 8 0.15849 9 0.12589 10 0.10000 11 0.07943 12 0.06310 13 0.05012 14 0.03981 15 0.03162 16 0.02512 17 0.01995 18 0.01585 19 0.01259 20 0.01000 21 0.00794 22 0.00631 23 0.00501 24 0.00398 25 0.00316 26 0.00251 27 0.00200 28 0.00158 29 0.00126 30 0.00100 31 0.00079 32 0.00063 33 0.00050 34 0.00040 35 0.00032 36 0.00025 37 0.00020 38 0.00016 39 0.00013 40 0.00010 41 0.00008 What is ASCII ASCII (American Standard Code for Information Interchange) is used to represent characters in computers. We can represent Phred scores using ASCII characters. The advantage is that the quality information can be esisly stored in text based FASTQ file. Not all ASCII characters are printable. The first printable ASCII character is ! and the decimal code for the character for ! is 33. # Store output in a vector to fit on a slide output <- c(sprintf(\"%-8s %-8s\", \"Character\", \"ASCII #\")) # Loop through ASCII values from 33 to 89 for (i in 33:89) { output <- c(output, sprintf(\"%-8s %-8d\", intToUtf8(i), i)) } # Print the output in a single block (e.g., to fit on a slide) cat(paste(output, collapse = \"\\n\")) Character ASCII # ! 33 \" 34 # 35 $ 36 % 37 & 38 ' 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 0 48 1 49 2 50 3 51 4 52 5 53 6 54 7 55 8 56 9 57 : 58 ; 59 < 60 = 61 > 62 ? 63 @ 64 A 65 B 66 C 67 D 68 E 69 F 70 G 71 H 72 I 73 J 74 K 75 L 76 M 77 N 78 O 79 P 80 Q 81 R 82 S 83 T 84 U 85 V 86 W 87 X 88 Y 89 Phred scores in FASTQ file In a FASTQ file, Phred scores are represented as ASCII characters. These characters are converted back to numeric values (PHRED scores) based on the encoding scheme used: PHRED+33 Encoding (Sanger/Illumina 1.8+) : The ASCII character for a quality score Q is calculated as: ASCII character=chr(Q+33) For example: A PHRED score of 30 is encoded as chr(30 + 33) = chr(63) , which corresponds to the ASCII character ? . PHRED+64 Encoding (Illumina 1.3-1.7) : The ASCII character for a quality score QQQ is calculated as: ASCII character=chr(Q+64) For example: A PHRED score of 30 is encoded as chr(30 + 64) = chr(94) , which corresponds to the ASCII character ^ . # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t\\t%-10s\\n\", \"Phred\", \"Prob. of\", \"ASCII\", \"ASCII\")) cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t%-10s\\n\", \"score\", \"Error\", \"Phred+33\", \"Phred+64\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { # Calculate the probability of error prob_error <- 10^(phred / -10) # Convert Phred scores to ASCII characters ascii_phred33 <- intToUtf8(phred + 33) ascii_phred64 <- intToUtf8(phred + 64) # Print the results in a formatted table cat(sprintf(\"%-5d\\t\\t%0.5f\\t\\t%-6s\\t\\t%-10s\\n\", phred, prob_error, ascii_phred33, ascii_phred64)) } Phred Prob. of ASCII ASCII score Error Phred+33 Phred+64 0 1.00000 ! @ 1 0.79433 \" A 2 0.63096 # B 3 0.50119 $ C 4 0.39811 % D 5 0.31623 & E 6 0.25119 ' F 7 0.19953 ( G 8 0.15849 ) H 9 0.12589 * I 10 0.10000 + J 11 0.07943 , K 12 0.06310 - L 13 0.05012 . M 14 0.03981 / N 15 0.03162 0 O 16 0.02512 1 P 17 0.01995 2 Q 18 0.01585 3 R 19 0.01259 4 S 20 0.01000 5 T 21 0.00794 6 U 22 0.00631 7 V 23 0.00501 8 W 24 0.00398 9 X 25 0.00316 : Y 26 0.00251 ; Z 27 0.00200 < [ 28 0.00158 = \\ 29 0.00126 > ] 30 0.00100 ? ^ 31 0.00079 @ _ 32 0.00063 A ` 33 0.00050 B a 34 0.00040 C b 35 0.00032 D c 36 0.00025 E d 37 0.00020 F e 38 0.00016 G f 39 0.00013 H g 40 0.00010 I h 41 0.00008 J i","title":"Phred FQ"},{"location":"Phred_FQ/#what-is-phred-scores","text":"A Phred score is a measure of the probability that a base call in a DNA sequencing read is incorrect. It is a logarithmic scale, meaning that a small change in the Phred score represents a large change in the probability of an error. $$Q = -10 \\cdot \\log_{10}(P)$$ Where: Q is the PHRED score. P is the probability that the base was called incorrectly. For example: Q = 20 : This corresponds to a 1 in 100 probability of an incorrect base call, or an accuracy of 99%. Q = 30 : This corresponds to a 1 in 1000 probability of an incorrect base call, or an accuracy of 99.9%. Q = 40 : This corresponds to a 1 in 10,000 probability of an incorrect base call, or an accuracy of 99.99%. # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"Phred\", \"Prob of\")) cat(sprintf(\"%-5s\\t\\t%-10s\\n\", \"score\", \"Incorrect call\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { cat(sprintf(\"%-5d\\t\\t%0.5f\\n\", phred, 10^(phred / -10))) } Phred Prob of score Incorrect call 0 1.00000 1 0.79433 2 0.63096 3 0.50119 4 0.39811 5 0.31623 6 0.25119 7 0.19953 8 0.15849 9 0.12589 10 0.10000 11 0.07943 12 0.06310 13 0.05012 14 0.03981 15 0.03162 16 0.02512 17 0.01995 18 0.01585 19 0.01259 20 0.01000 21 0.00794 22 0.00631 23 0.00501 24 0.00398 25 0.00316 26 0.00251 27 0.00200 28 0.00158 29 0.00126 30 0.00100 31 0.00079 32 0.00063 33 0.00050 34 0.00040 35 0.00032 36 0.00025 37 0.00020 38 0.00016 39 0.00013 40 0.00010 41 0.00008","title":"What is PHRED Scores"},{"location":"Phred_FQ/#what-is-ascii","text":"ASCII (American Standard Code for Information Interchange) is used to represent characters in computers. We can represent Phred scores using ASCII characters. The advantage is that the quality information can be esisly stored in text based FASTQ file. Not all ASCII characters are printable. The first printable ASCII character is ! and the decimal code for the character for ! is 33. # Store output in a vector to fit on a slide output <- c(sprintf(\"%-8s %-8s\", \"Character\", \"ASCII #\")) # Loop through ASCII values from 33 to 89 for (i in 33:89) { output <- c(output, sprintf(\"%-8s %-8d\", intToUtf8(i), i)) } # Print the output in a single block (e.g., to fit on a slide) cat(paste(output, collapse = \"\\n\")) Character ASCII # ! 33 \" 34 # 35 $ 36 % 37 & 38 ' 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 0 48 1 49 2 50 3 51 4 52 5 53 6 54 7 55 8 56 9 57 : 58 ; 59 < 60 = 61 > 62 ? 63 @ 64 A 65 B 66 C 67 D 68 E 69 F 70 G 71 H 72 I 73 J 74 K 75 L 76 M 77 N 78 O 79 P 80 Q 81 R 82 S 83 T 84 U 85 V 86 W 87 X 88 Y 89","title":"What is ASCII"},{"location":"Phred_FQ/#phred-scores-in-fastq-file","text":"In a FASTQ file, Phred scores are represented as ASCII characters. These characters are converted back to numeric values (PHRED scores) based on the encoding scheme used: PHRED+33 Encoding (Sanger/Illumina 1.8+) : The ASCII character for a quality score Q is calculated as: ASCII character=chr(Q+33) For example: A PHRED score of 30 is encoded as chr(30 + 33) = chr(63) , which corresponds to the ASCII character ? . PHRED+64 Encoding (Illumina 1.3-1.7) : The ASCII character for a quality score QQQ is calculated as: ASCII character=chr(Q+64) For example: A PHRED score of 30 is encoded as chr(30 + 64) = chr(94) , which corresponds to the ASCII character ^ . # Print the header cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t\\t%-10s\\n\", \"Phred\", \"Prob. of\", \"ASCII\", \"ASCII\")) cat(sprintf(\"%-5s\\t\\t%-10s\\t%-6s\\t%-10s\\n\", \"score\", \"Error\", \"Phred+33\", \"Phred+64\")) # Loop through Phred scores from 0 to 41 for (phred in 0:41) { # Calculate the probability of error prob_error <- 10^(phred / -10) # Convert Phred scores to ASCII characters ascii_phred33 <- intToUtf8(phred + 33) ascii_phred64 <- intToUtf8(phred + 64) # Print the results in a formatted table cat(sprintf(\"%-5d\\t\\t%0.5f\\t\\t%-6s\\t\\t%-10s\\n\", phred, prob_error, ascii_phred33, ascii_phred64)) } Phred Prob. of ASCII ASCII score Error Phred+33 Phred+64 0 1.00000 ! @ 1 0.79433 \" A 2 0.63096 # B 3 0.50119 $ C 4 0.39811 % D 5 0.31623 & E 6 0.25119 ' F 7 0.19953 ( G 8 0.15849 ) H 9 0.12589 * I 10 0.10000 + J 11 0.07943 , K 12 0.06310 - L 13 0.05012 . M 14 0.03981 / N 15 0.03162 0 O 16 0.02512 1 P 17 0.01995 2 Q 18 0.01585 3 R 19 0.01259 4 S 20 0.01000 5 T 21 0.00794 6 U 22 0.00631 7 V 23 0.00501 8 W 24 0.00398 9 X 25 0.00316 : Y 26 0.00251 ; Z 27 0.00200 < [ 28 0.00158 = \\ 29 0.00126 > ] 30 0.00100 ? ^ 31 0.00079 @ _ 32 0.00063 A ` 33 0.00050 B a 34 0.00040 C b 35 0.00032 D c 36 0.00025 E d 37 0.00020 F e 38 0.00016 G f 39 0.00013 H g 40 0.00010 I h 41 0.00008 J i","title":"Phred scores in FASTQ file"},{"location":"basic_linux/","text":"Linux for Bioinformatics Navigating in Linux file system You are in your home directory after you log into the system and are directed to the shell command prompt. This section will show you hot to explore Linux file system using shell commands. Path To understand Linux file system, you can image it as a tree structure. In Linux, a path is a unique location of a file or a directory in the file system. For convenience, Linux file system is usually thought of in a tree structure. On a standard Linux system you will find the layout generally follows the scheme presented below. The tree of the file system starts at the trunk or slash, indicated by a forward slash ( / ). This directory, containing all underlying directories and files, is also called the root directory or \u201cthe root\u201d of the file system. %%bash ## In your account, you will see a folder ## with you account ID as the name cd ~ echo $HOME /home/xie186 Relative and absolute path Absolute path An absolute path is defined as the location of a file or directory from the root directory(/). An absolute path starts from the root of the tree ( / ). Here are some examples: /home/xie186 /home/xie186/.bashrc Relative path Relative path is a path related to the present working directory: data/sample1/ and ../doc/ . If you want to get the absolute path based on relative path , you can use readlink with parameter -f : pwd readlink -f ../ Once we enter into a Linux file system, we need to 1) know where we are; 2) how to get where we want; 3) how to know what files or directories we have in a particular path. Check where you are using command pwd In order to know where we are, we need to use pwd command. The command pwd is short for \u201cprint name of current/working directory\u201d. It will return the full path of current directory. Command pwd is almost always used by itself. This means you only need to type pwd and press ENTER %%bash pwd Listing the contents using command ls After you know where you are, then you want to know what you have in that directory, we can use command ls to list directory contents Its syntax is: ls [option]... [file]... ls with no option will list files and directories in bare format. Bare format means the detailed information (type, size, modified date and time, permissions and links etc) won\u2019t be viewed. When you use ls by itself, it will list files and directories in the current directory. ls ~/ ls -a ls -ld Linux command options can be combined without a space between them and with a single - (dash). The following command is a faster way to use the l and a options and gives the same output as the Linux command shown above. ls -lt ~/.bashrc -rw-r--r--. 1 xie186 zt-bioi611 1067 Aug 22 22:27 /home/xie186/.bashrc Change directory using command cd Unlike pwd , when you use cd you usually need to provide the path (either absolute or relative path) which we want to enter. If you didn\u2019t provide any path information, you will change to home directory by default. Path Shortcuts Description Single dot . The current folder Double dots .. The folder above the current folder Tilde character ~ Home directory (normally the directory:/home/my_login_name) Dash - Your last working directory Here are some examples: cd ~ pwd ls ls ../ ## pwd cd ../ pwd cd ./ pwd Each directory has two entries in it at the start, with names . (a link to itself) and .. (a link to its parent directory). The exception, of course, is the root directory, where the .. directory also refers to the root directory. Sometimes you go to a new directory and do something, then you remember that you need to go to the previous working direcotry. To get back instantly, use a dash. %%bash # This is our current directory pwd # Let us go our home diretory cd ~ # Check where we are pwd # Let us go to your previous working directory cd - # Check where we are now pwd /home/xie186/BIOI611_lab/docs /home/xie186 /home/xie186/BIOI611_lab/docs /home/xie186/BIOI611_lab/docs Manipulations of files and directories In Linux, manipulations of files and directories are the most frequent work. In this section, you will learn how to copy, rename, remove, and create files and directories. Command line cp In Linux, command cp can help you copy files and directories into a target directory. Command line mv Move files/folders and rename file/folders using mv : # move file from one location to another mv file1 target_direcotry/ # rename mv file1 file2 mv file1 file2 file3 target_direcotry/ Command mkdir The syntax is shown as below: mkdir [OPTION ...] DIRECTORY ... Multiple directories can be specified when calling mkdir mkdir directory1 directory2 mkdir -p foo/bar/baz How to defining complex directory trees with one command: mkdir -p project/{software,results,doc/{html,info,pdf},scripts} Then you can view the directory using tree . Command rm You can use rm to remove both files and directories. ## You can remove one file. rm file1 ## `rm` can remove multiple files simutaneously rm file2 file3 You can also use 'rm' to remove a folder. If a folder is empty, you can remove it using rm with -r . rm -r FOLDER If a folder is not empty, you can remove it using rm with -r and -f . mkdir test_folder rm -r test_folder View text files in Linux Commands cat , more and less The command cat is short for concatenate files and print on the standard output. The syntax is shown as below: cat [OPTION]... [FILE]... For small text file, cat can be used to view the files on the standard output. The command more is old utility. When the text passed to it is too large to fit on one screen, it pages it. You can scroll down but not up. The syntaxt of more is shown below: more [options] file [...] The command less was written by a man who was fed up with more\u2019s inability to scroll backwards through a file. He turned less into an open source project and over time, various individuals added new features to it. less is massive now. That\u2019s why some small embedded systems have more but not less. For comparison, less\u2019s source is over 27000 lines long. more implementations are generally only a little over 2000 lines long. The syntaxt of less is shown below: less [options] file [...] Command head and tail The command head is used to output the first part of files. By default, it outputs the first 10 lines of the file. head [OPTION]... [FILE]... Here is an exmaple of printing the first 5 files of the file: head -n 5 code_perl/variable_assign.pl In fact, the letter n does not even need to be used at all. Just the hyphen and the integer (with no intervening space) are sufficient to tell head how many lines to return. Thus, the following would produce the same result as the above commands: head -5 target_file.txt The command tail is used to output the last part of files. By default, it prints the last 10 lines of the file to standard output. The syntax is shown below: tail [OPTION]... [FILE]... Here is an exmaple of printing the last 5 files of the file: tail -5 target_file.txt To view lines from a specific point in a file, you can use -n +NUMBER with the tail command. For example, here is an example of viewing the file from the 2nd line of the line. tail -n +2 target_file.txt Auto-completion In most Shell environment, programmable completion feature will also improve your speed of typing. It permits typing a partial name of command or a partial file (or directory), then pressing TAB key to auto-complete the command. If there are more than one possible completions, then TAB will list all of them. A handy autocomplete feature also exists. Type one or more letters, press the Tab key twice, and then a list of functions starting with these letters appears. For example: type so , press the Tab key twice, and then you get the list as: soelim sort sotruss soundstretch source Demonstration of programmable completion feature. File permissions In Linux, file permissions are a vital aspect of system security and resource management. This is particularly important in bioinformatics, where large datasets and scripts are often shared across teams. Permissions determine who can read, write, or execute a file, ensuring that critical data is not accidentally modified or deleted. Three Permission Categories : User (u): The owner of the file. Group (g): A group of users who share access to the file. Other (o): All other users on the system. Permission Types : Read (r): Ability to view the contents of a file. Write (w): Ability to modify or delete the file. Execute (x): Ability to run the file as a program (for scripts or executables). %%bash groups $USER animako eunal gstewar1 mjames17 mjeakle nmilza rahooper xie186 : zt-bioi611 zt-bioi611_mgr animako : zt-bioi611 eunal : zt-bioi611 gstewar1 : zt-bioi611 mjames17 : zt-bioi611 mjeakle : zt-bioi611 nmilza : zt-bioi611 rahooper : zt-bioi611 %%bash mkdir -p ~/test_permission/ touch ~/test_permission/test.txt ls -l ~/test_permission/ rm -rf ~/test_permission/ total 0 -rw-r--r--. 1 xie186 zt-bioi611 0 Sep 8 22:52 test.txt Here, the first character represents the type of file (e.g., - for a regular file or d for a directory), followed by three groups of three characters, each representing the permissions for the user , group , and others , respectively. Examples: -rwxr-xr-- : The owner has read , write , and execute permissions. The group has read and execute permissions, while others can only read the file. drwxr-x--- : A directory where the owner can read, write, and access (execute). The group can only read and access, while others have no permissions. Modify file permissions using the chmod command. Permissions can be set in two ways: Symbolic Mode: In symbolic mode, you modify permissions by referencing the categories (user, group, other) and specifying whether you're adding (+), removing (-), or setting (=) permissions. # Add execute permission for the user: chmod u+x filename # Remove write permission for the group: chmod g-w filename # Set read-only permission for others: chmod o=r filename Symbolic mode is intuitive and flexible, especially when you want to make precise adjustments to permissions without affecting other categories. This is useful for common file-sharing tasks in bioinformatics where you need to tweak access for specific collaborators. Numeric Mode (Octal representation): In numeric mode, file permissions are set using a three-digit number. Each digit represents the permissions for user , group , and other , respectively. The digits are calculated by adding the values of the read , write , and execute` permissions: Read (r) = 4 Write (w) = 2 Execute (x) = 1 Example Permission Breakdown: Read (r), Write (w), and Execute (x) for user = 7 Read (r) and Execute (x) for group = 5 Read (r) only for others = 4 chmod 754 filename An example to help you understand executable : %%bash printf '#!/user/bin/python\\nprint(\"Hello, Welcome to Course BIOI611!\")' > ~/test.py %%bash ls -l ~/test.py python ~/test.py -rw-r--r--. 1 xie186 zt-bioi611 61 Sep 8 23:06 /home/xie186/test.py Hello, Welcome to Course BIOI611! Error message below will be thrown out if you consider ~/test.py as a program: bash: line 1: /home/xie186/test.py: No such file or directory %%bash chmod u+x ~/test.py ls -l ~/test.py python ~/test.py rm ~/test.py -rwxr--r--. 1 xie186 zt-bioi611 61 Sep 8 23:06 /home/xie186/test.py Hello, Welcome to Course BIOI611! Disk Usage of Files and Directories The Linux du (short for Disk Usage) is a standard Unix/Linux command, used to check the information of disk usage of files and directories on a machine. The du command has many parameter options that can be used to get the results in many formats. The du command also displays the files and directory sizes in a recursively manner. %%bash du -h ~/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref 2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref %%bash du -ah ~/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref 2.9M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbList.fromGTF.out.tab 7.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/Log.out 936M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/SA 1.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/SAindex 3.0M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/transcriptInfo.tab 2.3M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbList.out.tab 1.5M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/geneInfo.tab 1.0K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/genomeParameters.txt 512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrLength.txt 512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrNameLength.txt 512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrStart.txt 7.6M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/exonGeTrInfo.tab 3.1M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/exonInfo.tab 2.8M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/sjdbInfo.txt 512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/chrName.txt 119M /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref/Genome 2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref %%bash du -csh /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/* 19G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/raw_data 0 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/raw_data_smart_seq 1.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_data.sub 575K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq-7478223-xie186.err 0 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq-7478223-xie186.out 8.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s1_download_smart_seq.sub 2.5K /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/s2_star.sub 34G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_align 2.5G /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/STAR_ref 512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/test.sub 512 /home/xie186/scratch.bioi611/Analysis/bulk_RNAseq/test.txt 55G total Symbolic link Symbolic link, similar to shortcuts, can point to another file/folder. ln -s
+
+
+