In the Linux and Unix operating systems, everything is treated as a file. Whenever possible, those files are stored as human- and machine-readable text files. As a result, Linux contains a set of commands that are specialized for working with texts. Here we will explore most used text-manipulating commands in Linux.
sed
is a powerful text stream editor. Can do insertion, deletion, search and replace(substitution). sed
also supports regular expression which allows it perform complex pattern matching.
Syntax:
sed OPTION [SCRIPT/PATTERN MATCHING] [INPUTFILE]
Figure1. cat search syntax
Sed command is mostly used to replace the text in a file. Consider the below text file as input
$ cat geekfile.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unix linux which one you choose.
unix is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.
Now we will replace the word "unix" with "linux" in geekfile.txt. For that we need to use the substitution flag s/
to indicate sed we want to perform a substitution in the pattern seach. See the code below:
$sed 's/unix/linux/' geekfile.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linux linux which one you choose.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.
Use the /1
, /2
, etc flags to replace the firs, second, etc occurrence of a pattern in a line. The below command replace the second occurrence of the word "unix" with "linux" in a line:
$ sed 's/unix/linux/2' geekfile.txt
unix is great os. linux is opensource. unix is free os.
learn operating system.
unix linux which one you choose.
unix is easy to learn.linux is a multiuser os.Learn unix .unix is a powerful.
The substitute flag /g (global replacement) specifies the sed command to replace all the occurrences of the string in the line.
$ sed 's/unix/linux/g' geekfile.txt
linux is great os. linux is opensource. linux is free os.
learn operating system.
linux linux which one you choose.
linux is easy to learn.linux is a multiuser os.Learn linux .linux is a powerful.
You can restrict the sed command to replace the string on a specific line number ver easily. An example to search the pattern only the 3rd line:
$ sed '3 s/unix/linux/g' geekfile.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
linux linux which one you choose.
unix is easy to learn.linux is a multiuser os.Learn linux .linux is a powerful.
The /p
print flag prints the replaced line twice on the terminal. If a line does not have the search pattern and is not replaced, then the /p
prints that line only once:
$ sed 's/unix/linux/p' geekfile.txt
linux is great os. unix is opensource. unix is free os.
linux is great os. unix is opensource. unix is free os.
learn operating system.
linux linux which one you choose.
linux linux which one you choose.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.
Use the -n
option along with the /p
print flag to display only the replaced lines. Here the -n
option suppresses the duplicate rows generated by the /p
flag and prints ther replaced lines only one time:
$ sed -n 's/unix/linux/p' geekfile.txt
linux is great os. unix is opensource. unix is free os.
linux linux which one you choose.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.
Using sed
you can perform deletion operations in a specficic line of a file. You only need to add deletion flag d
to the line number you want to remove. For example, to delete the 3th line in a file:
$ sed '3d' geekfile.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unix is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.
In addition, you can simply remove lines containing the matching matters also using flag d
in the matching pattern:
$ sed '/learn/d' geekfile.txt
unix is great os. unix is opensource. unix is free os.
unix linux which one you choose.
grep
is an acronym that stands for Global Regular Expression Print.
grep is a Linux / Unix command line tool used to search for a string of characters in a
specified file. The text search pattern is called a regular expression. When it finds a match,
it prints the line with the result. The grep command is handy when searching through
massive log files.
To search 'atg' character pattern in a file you can use grep as it is showed below:
$grep atg fasta.fa
atgtcgatcgatagtcgatagct
atgcgcgcayacygaycacactagatcgatc
atgctagaatagctcgcgcctagagatagctcgatac
Searching in multiple files:
$grep atg fasta.fa fasta2.fa
fasta.fa:atgtcgatcgatagtcgatagct
fasta.fa:atgcgcgcayacygaycacactagatcgatc
fasta.fa:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:atgctagaatagctcgcgcctagagatagctcgatac
See how grep also indicate which file belogn each line matching the pattern. You can also search in all the files stored in a folder, for that you can use the *
wildcard:
$grep atg *
fasta.fa:atgtcgatcgatagtcgatagct
fasta.fa:atgcgcgcayacygaycacactagatcgatc
fasta.fa:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:atgctagaatagctcgcgcctagagatagctcgatac
To print only the filenames that match your search, use the -l
(list) operator:
$grep -l atg *
fasta.fa
fasta2.fa
To print only the line that contains the complete word, use -w
(word) operator. The expression is searched for as a word
$ grep -w atg fasta.fa
##
But if we write the whole word...
$ grep -w atgtcgatcgatagtcgatagct fasta.fa
fasta.fa:atgtcgatcgatagtcgatagct
-i
(ignore-case) operator, perform case insensitive matching. y default, grep is case sensitive.
$ grep ATG fasta.fa
##
$ grep -i ATG fasta.fa
fasta.fa:atgtcgatcgatagtcgatagct
fasta.fa:atgcgcgcayacygaycacactagatcgatc
fasta.fa:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:atgctagaatagctcgcgcctagagatagctcgatac
Using -v
operator grep will print all lines that do not match the specific pattern of characters
$ grep -v atg fasta2.fa
>seq4
>seq5
>seq6
ttgatcgctattataggcttcgatagac
This operator is very useful to remove the header of bioinformatic files using in conjunction with ^
metacharacter. Bioinformatic files usually contains in the first lines a header in which its explained the content of this file, or the name of columns. This header starts always with a #
character. Have a look to fasta2.fa file to see how its look like:
head -n 4 fast2.fa
#Header
>seq4
atgcgcgcayacygaycacactagatcgatc
The information of the header could affect downstring analysis so it should be removed, for this we can take advantage of -v
operator using grep:
grep -v "^#" fasta2.fa
I recomend you to use " " when you start to search for more complex regex using metacharacters.
>seq4
atgcgcgcayacygaycacactagatcgatc
>seq5
atgctagaatagctcgcgcctagagatagctcgatac
>seq6
ttgatcgctattataggcttcgatagac
Append the -n
operator to any grep command to show the line numbers
$ grep -n atg *
fasta.fa:2:atgtcgatcgatagtcgatagct
fasta.fa:12:atgcgcgcayacygaycacactagatcgatc
fasta.fa:16:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:2:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:6:atgctagaatagctcgcgcctagagatagctcgatac
Append the -o
operator to any grep command to show only the strings that matches your RE.
$ grep -o at fasta.fa
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
The cut
command in UNIX is a command for cutting out the sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and field. Basically the cut command slices a line and extracts the text.
cut OPTIONS [FILE]
In this course we are going to see how to use cut only to select by field -f
option. cut
use tab as a default field delimiter but can also work with other delimiter by using -d
option. Important: space is not considered as delimiter in UNIX.
Syntax:
$ cut -d "delimiter" -f (field number) file.txt
Here an example, we will extract the 3rd column of the Mus_musculus.GRCm38.75_chr1.bed file which contain the start position of all genes in the Mus musculus genome
$ cut -f3 Mus_musculus.GRCm38.75_chr1.bed
3054733
3054733
3054733
3102125
...
Everything fine, .bed files are tabulated space files. Now we will se in a comma separated value (.csv) file:
$ cut -f3 Mus_musculus.GRCm38.75_chr1_bed.csv
1,3054233,3054733
1,3054233,3054733
1,3054233,3054733
1,3102016,3102125
Upps! it did not work as we expected. Remember that cut only recognize tab separators, and in this case we are working with a comma separated file. We can indicate cut
which delimiter it should use to separete the fields by using -d
option
$ cut -d "," -f3 Mus_musculus.GRCm38.75_chr1.bed
3054733
3054733
3054733
3102125
...
SORT
command is used to sort a file, arranging the records in a particular order. By default, the sort command sorts file assuming the contents are ASCII. Using options in the sort command can also be used to sort numerically.
- SORT command sorts the contents of a text file, line by line.
- sort is a standard command-line program that prints the lines of its input or concatenation of all files listed in its argument list in sorted order.
- The sort command is a command-line utility for sorting lines of text files. It supports sorting alphabetically, in reverse order, by number, by month, and can also remove duplicates.
- The sort command can also sort by items not at the beginning of the line, ignore case sensitivity, and return whether a file is sorted or not. Sorting is done based on one or more sort keys extracted from each line of input.
- Lines starting with a number will appear before lines starting with a letter.
- Lines starting with a letter that appears earlier in the alphabet will appear before lines starting with a letter that appears later in the alphabet.
- Lines starting with a uppercase letter will appear before lines starting with the same letter in lowercase.
Example: 1 > B > a > b
sort mix.txt
1242
Abc
BALL
abc
apple
bat
Note: This command does not actually change the input file, i.e. file.txt. For that you can use output redirections or the -o
option, a built-in sort option -o, which allows you to specify an output file.
$ sort mix.txt > output.txt
$ sort -o output.txt mix.txt
You can perform a reverse-order sort using the -r
flag. the -r
flag is an option of the sort command which sorts the input file in reverse order.
$ sort -r mix.txt
bat
apple
abc
BALL
Abc
1242
To sort a file numerically by arithmetic values used –n option.
cat numbers.txt
70
29
18
98
200
1400
first we will sort the file using default parameters:
sort numbers.txt
1400
18
200
29
70
98
This is cleary not numerically sorted. Now we are going to add -n option:
sort -n numbers.txt
18
29
70
98
200
1400
much better!!, and what happend if you want to sort your numeric data in reverse order from greater to lesser. you can use a combination of the two options -r and -n.
sort -rn numbers.txt
1400
200
98
70
29
18
sort
provides the feature of sorting a table on the basis of any column number by using -k
option
Use the -k
option to sort on a certain column. For example, use -k 2
to sort on the second column.
cat Mus_musculus.GRCm38.75_chr1.bed | sort -nk2 | head
1 3054233 3054733
1 3054233 3054733
1 3054233 3054733
1 3102016 3102125
1 3102016 3102125
...
To sort and remove duplicates pass the -u option to sort. This will write a sorted list to standard output and remove duplicates, printing only the unique line. This option is helpful as the duplicates being removed give us a redundant file
cat Mus_musculus.GRCm38.75_chr1.bed | sort -unk2 | head
1 3054233 3054733
1 3102016 3102125
1 3205901 3207317
1 3205901 3216344
1 3205901 3671498
1 3213609 3216344
...
In simple words, uniq
is the tool that helps to detect the adjacent duplicate lines and also deletes the duplicate lines. uniq filters out the adjacent matching lines from the input file(that is required as an argument) and either print/write the filtered data to the stdout/output file.
Syntax of uniq Command :
$uniq [OPTION] [INPUT[OUTPUT]]
Here, INPUT refers to the input file in which repeated lines need to be filtered out and if INPUT.OUTPUT refers to the output file in which you can store the filtered output generated by uniq command and as in the case of INPUT if OUTPUT isn’t specified then uniq writes to the standard output.
uniq
isn’t able to detect the duplicate lines unless they are adjacent. The content in the file must be therefore sorted before using uniq or you can simply use sort -u instead of uniq.
$ head Mus_musculus.GRCm38.75_chr1.bed |sort | uniq
3054233
3102016
3205901
3213609
-c -count
: It tells how many times a line was repeated by displaying a number as a prefix with the line.
$ head Mus_musculus.GRCm38.75_chr1.bed |sort | uniq -c
3 3054233
3 3102016
3 3205901
1 3213609
The tr
command in UNIX is a command line utility for translating or deleting characters. It supports a range of transformations including uppercase to lowercase, squeezing repeating characters, deleting specific characters and basic find and replace. It can be used with UNIX pipes to support more complex translation. tr
stands for translate
$ tr [OPTION] SET1 [SET2]
set1 is/are the character/s to modify by set2
character/s
To convert from lower case to upper case the predefined sets in tr can be used.
$cat geekfile
WELCOME TO
GeeksforGeeks
$cat geekfile | tr “[:lower:]” “[:upper:]”
WELCOME TO
GEEKSFORGEEKS
or you can use metacharacters:
$cat geekfile | tr “[a-z]” “[A-Z]”
WELCOME TO
GEEKSFORGEEKS
The following command will translate all the white-space to tabs
echo "Welcome To GeeksforGeeks" | tr [:space:] '\t'
Welcome To GeeksforGeeks
question: what happen if you try to do the same but using instead of [:space:] only a blank space " "?
To removes repeated instances of a character in a set use the -s
option.
OR we can say that,you can convert multiple continuous spaces with a single space
$ echo "Welcome To GeeksforGeeks" | tr -s [:space:] ' '
Output:
Welcome To GeeksforGeeks
You can also translate from and to a file. In this example we will translate braces in a file with parenthesis.
$cat geekfile2
{WELCOME TO}
GeeksforGeeks
cat geekfile2 | tr '{}' '()'
(WELCOME TO)
GeeksforGeeks
To delete specific characters use the -d
option.This option deletes characters in the first set specified.
$ echo "Welcome To GeeksforGeeks" | tr -d 'W'
Output:
elcome To GeeksforGeeks
excercise: Transform an apply the fasta format to the dna sequence stored in tr_excercise file.
The join command in UNIX is a command line utility for joining lines of two files on a common field.
Suppose you have two files and there is a need to combine these two files in a way that the output makes even more sense.For example, there could be a file containing names and the other containing ID’s and the requirement is to combine both files in such a way that the names and corresponding ID’s appear in the same line. join command is the tool for it. join command is used to join the two files based on a key field present in both the files. The input file can be separated by white space or any delimiter. Syntax:
$join [OPTION] FILE1 FILE2
Example : Let us assume there are two files file1.txt and file2.txt and we want to combine the contents of these two files.
// displaying the contents of first file //
$cat file1.txt
1 AAYUSH
2 APAAR
3 HEMANT
4 KARTIK
// displaying contents of second file //
$cat file2.txt
1 101
2 102
3 103
4 104
Now, in order to combine two files the files must have some common field. In this case, we have the numbering 1, 2... as the common field in both the files.
NOTE : When using join command, both the input files should be sorted on the KEY on which we are going to join the files.
$join file1.txt file2.txt
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104
// by default join command takes the
first column as the key to join as
in the above case //
So, the output contains the key followed by all the matching columns from the first file file1.txt, followed by all the columns of second file file2.txt.
sometimes it is possible that one of the files contain extra fields so what join command does in that case is that by default, it only prints pairable lines. For example, even if file file1.txt contains an extra field provided that the contents of file2.txt then the output produced by join command would be same:
//displaying the contents of file1.txt//
$cat file1.txt
1 AAYUSH
2 APAAR
3 HEMANT
4 KARTIK
5 DEEPAK
//displaying contents of file2.txt//
$cat file2.txt
1 101
2 102
3 103
4 104
//using join command//
$join file1.txt file2.txt
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104
// although file1.txt has extra field the
output is not affected cause the 5 column in
file1.txt was unpairable with any in file2.txt//
What if such unpairable lines are important and must be visible after joining the files. In such cases we can use -a
option with join command which will help in displaying such unpairable lines. This option requires the user to pass a file number so that the tool knows which file you are talking about.
//using join with -a option//
//1 is used with -a to display the contents of
first file passed//
$join file1.txt file2.txt -a 1
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104
5 DEEPAK
//5 column of first file is
also displayed with help of -a option
although it is unpairable//
As we already know that join combines lines of files on a common field, which is first field by default. However, it is not necessary that the common key in the both files always be the first column.join command provides options if the common key is other than the first column. In this case, if you want to use the second field of either file or both the files to be the common field for join, you can do this by using the -1
and -2
command line options. The -1
and -2
here represents the first and second file and these options requires a numeric argument that refers to the joining field for the corresponding file:
Syntax:
join -1[field] -2[field] file1 file2
This will be easily understandable with the example below:
//displaying contents of first file//
$cat file1.txt
AAYUSH 1
APAAR 2
HEMANT 3
KARTIK 4
//displaying contents of second file//
$cat file2.txt
101 1
102 2
103 3
104 4
//now using join command //
$join -1 2 -2 2 file1.txt file2.txt
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104
//here -1 2 refers to the use of 2 column of
first file as the common field and -2 2
refers to the use of 2 column of second
file as the common field for joining//
The column
command formats its input into multiple columns. Useful for pretty-printing displays.
Usage:
$ column –s "Separator" –t file
-t
Determine the number of columns the input contains and create a table.
-s
Columns are delimited with whitespace, by default, or with the characters
supplied using the -s
option
//displaying contents of table file//
cat table
column1 column2
1 2
3 45
43 565
234234 5454532
4223 43252345
2343214 54545
2 454325245
32542 452524
//now using column command with //
column -t table
column1 column2
1 2
3 45
43 565
234234 5454532
4223 43252345
2343214 54545
2 454325245
32542 452524
Using the file forest_coverage_percent.csv stored at Resources folder obtein next data.
- Number of countries
- country column sorted alphabetically
- Number of years analyzed
- Obtein all the forest coverage data for Germany
- which coverage percent was obteined by Germany in 1999?
- Which country has the highest coverage percent in 1998 and in the year 2000?