Skip to content

Latest commit

 

History

History
912 lines (677 loc) · 22 KB

Lesson_5_text_manipulating_commands.md

File metadata and controls

912 lines (677 loc) · 22 KB

UNIX Lesson 5: Text manipulating commands

drawing

In the Linux and Unix operating systems, everything is treated as a file. Whenever possible, those files are stored as human- and machine-readable text files. As a result, Linux contains a set of commands that are specialized for working with texts. Here we will explore most used text-manipulating commands in Linux.



sed

sed is a powerful text stream editor. Can do insertion, deletion, search and replace(substitution). sed also supports regular expression which allows it perform complex pattern matching.

Syntax:

sed OPTION [SCRIPT/PATTERN MATCHING] [INPUTFILE]

drawing

Figure1. cat search syntax



working with sed


1. Replacing or substituting string

Sed command is mostly used to replace the text in a file. Consider the below text file as input

$ cat geekfile.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unix linux which one you choose.
unix is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.

Now we will replace the word "unix" with "linux" in geekfile.txt. For that we need to use the substitution flag s/ to indicate sed we want to perform a substitution in the pattern seach. See the code below:

$sed 's/unix/linux/' geekfile.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linux linux which one you choose.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.



2. Replacing the nth occurence of a pattern in a line

Use the /1, /2, etc flags to replace the firs, second, etc occurrence of a pattern in a line. The below command replace the second occurrence of the word "unix" with "linux" in a line:

$ sed 's/unix/linux/2' geekfile.txt
unix is great os. linux is opensource. unix is free os.
learn operating system.
unix linux which one you choose.
unix is easy to learn.linux is a multiuser os.Learn unix .unix is a powerful.


3. Replacing all the occurence of a pattern in a line

The substitute flag /g (global replacement) specifies the sed command to replace all the occurrences of the string in the line.

$ sed 's/unix/linux/g' geekfile.txt
linux is great os. linux is opensource. linux is free os.
learn operating system.
linux linux which one you choose.
linux is easy to learn.linux is a multiuser os.Learn linux .linux is a powerful.


4. Replacing a string on a specific line number

You can restrict the sed command to replace the string on a specific line number ver easily. An example to search the pattern only the 3rd line:

$ sed '3 s/unix/linux/g' geekfile.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
linux linux which one you choose.
unix is easy to learn.linux is a multiuser os.Learn linux .linux is a powerful.


5. Duplicating the replaced line

The /p print flag prints the replaced line twice on the terminal. If a line does not have the search pattern and is not replaced, then the /p prints that line only once:

$ sed 's/unix/linux/p' geekfile.txt
linux is great os. unix is opensource. unix is free os.
linux is great os. unix is opensource. unix is free os.
learn operating system.
linux linux which one you choose.
linux linux which one you choose.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.


6. Printing only the replaced lines

Use the -n option along with the /p print flag to display only the replaced lines. Here the -n option suppresses the duplicate rows generated by the /p flag and prints ther replaced lines only one time:

$ sed -n 's/unix/linux/p' geekfile.txt
linux is great os. unix is opensource. unix is free os.
linux linux which one you choose.
linux is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.



7. Deleting lines or patterns from a particular file

Using sedyou can perform deletion operations in a specficic line of a file. You only need to add deletion flag d to the line number you want to remove. For example, to delete the 3th line in a file:

$ sed '3d' geekfile.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unix is easy to learn.unix is a multiuser os.Learn unix .unix is a powerful.

In addition, you can simply remove lines containing the matching matters also using flag d in the matching pattern:

$ sed '/learn/d' geekfile.txt
unix is great os. unix is opensource. unix is free os.
unix linux which one you choose.


grep

grep is an acronym that stands for Global Regular Expression Print. grep is a Linux / Unix command line tool used to search for a string of characters in a specified file. The text search pattern is called a regular expression. When it finds a match, it prints the line with the result. The grep command is handy when searching through massive log files.

1. Searching pattern in files

To search 'atg' character pattern in a file you can use grep as it is showed below:

$grep atg fasta.fa
atgtcgatcgatagtcgatagct
atgcgcgcayacygaycacactagatcgatc
atgctagaatagctcgcgcctagagatagctcgatac

Searching in multiple files:

$grep atg fasta.fa fasta2.fa
fasta.fa:atgtcgatcgatagtcgatagct
fasta.fa:atgcgcgcayacygaycacactagatcgatc
fasta.fa:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:atgctagaatagctcgcgcctagagatagctcgatac

See how grep also indicate which file belogn each line matching the pattern. You can also search in all the files stored in a folder, for that you can use the *wildcard:

$grep atg *
fasta.fa:atgtcgatcgatagtcgatagct
fasta.fa:atgcgcgcayacygaycacactagatcgatc
fasta.fa:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:atgctagaatagctcgcgcctagagatagctcgatac

2. grep options

2.1 To list names of matching files

To print only the filenames that match your search, use the -l (list) operator:

$grep -l  atg *
fasta.fa
fasta2.fa

2.2 To find whole word only

To print only the line that contains the complete word, use -w (word) operator. The expression is searched for as a word

$ grep -w atg fasta.fa
##

But if we write the whole word...

$ grep -w atgtcgatcgatagtcgatagct fasta.fa
fasta.fa:atgtcgatcgatagtcgatagct

2.3 To ignore case sensitive

-i (ignore-case) operator, perform case insensitive matching. y default, grep is case sensitive.

$ grep  ATG fasta.fa
## 
$ grep -i ATG fasta.fa
fasta.fa:atgtcgatcgatagtcgatagct
fasta.fa:atgcgcgcayacygaycacactagatcgatc
fasta.fa:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:atgctagaatagctcgcgcctagagatagctcgatac

2.4 Inverse grep search

Using -voperator grep will print all lines that do not match the specific pattern of characters

$ grep -v atg fasta2.fa
>seq4
>seq5
>seq6
ttgatcgctattataggcttcgatagac

This operator is very useful to remove the header of bioinformatic files using in conjunction with ^ metacharacter. Bioinformatic files usually contains in the first lines a header in which its explained the content of this file, or the name of columns. This header starts always with a # character. Have a look to fasta2.fa file to see how its look like:

head -n 4 fast2.fa
#Header

>seq4
atgcgcgcayacygaycacactagatcgatc

The information of the header could affect downstring analysis so it should be removed, for this we can take advantage of -v operator using grep:

grep -v "^#" fasta2.fa

I recomend you to use " " when you start to search for more complex regex using metacharacters.


>seq4
atgcgcgcayacygaycacactagatcgatc

>seq5
atgctagaatagctcgcgcctagagatagctcgatac

>seq6
ttgatcgctattataggcttcgatagac

2.4 Display line number in the grep matches

Append the -n operator to any grep command to show the line numbers

$ grep -n atg *
fasta.fa:2:atgtcgatcgatagtcgatagct
fasta.fa:12:atgcgcgcayacygaycacactagatcgatc
fasta.fa:16:atgctagaatagctcgcgcctagagatagctcgatac
fasta2.fa:2:atgcgcgcayacygaycacactagatcgatc
fasta2.fa:6:atgctagaatagctcgcgcctagagatagctcgatac

2.5 Display only the matches

Append the -o operator to any grep command to show only the strings that matches your RE.

$ grep -o at fasta.fa
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at

cut

The cut command in UNIX is a command for cutting out the sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and field. Basically the cut command slices a line and extracts the text.

cut OPTIONS [FILE]

In this course we are going to see how to use cut only to select by field -f option. cutuse tab as a default field delimiter but can also work with other delimiter by using -d option. Important: space is not considered as delimiter in UNIX.

Syntax:

$ cut -d "delimiter" -f (field number) file.txt

Here an example, we will extract the 3rd column of the Mus_musculus.GRCm38.75_chr1.bed file which contain the start position of all genes in the Mus musculus genome

$ cut -f3 Mus_musculus.GRCm38.75_chr1.bed
3054733
3054733
3054733
3102125
...

Everything fine, .bed files are tabulated space files. Now we will se in a comma separated value (.csv) file:

$ cut -f3 Mus_musculus.GRCm38.75_chr1_bed.csv
1,3054233,3054733
1,3054233,3054733
1,3054233,3054733
1,3102016,3102125

Upps! it did not work as we expected. Remember that cut only recognize tab separators, and in this case we are working with a comma separated file. We can indicate cut which delimiter it should use to separete the fields by using -d option

$ cut -d "," -f3 Mus_musculus.GRCm38.75_chr1.bed
3054733
3054733
3054733
3102125
...

sort

SORT command is used to sort a file, arranging the records in a particular order. By default, the sort command sorts file assuming the contents are ASCII. Using options in the sort command can also be used to sort numerically.

  • SORT command sorts the contents of a text file, line by line.
  • sort is a standard command-line program that prints the lines of its input or concatenation of all files listed in its argument list in sorted order.
  • The sort command is a command-line utility for sorting lines of text files. It supports sorting alphabetically, in reverse order, by number, by month, and can also remove duplicates.
  • The sort command can also sort by items not at the beginning of the line, ignore case sensitivity, and return whether a file is sorted or not. Sorting is done based on one or more sort keys extracted from each line of input.

The sort command follows these features as stated below:

  1. Lines starting with a number will appear before lines starting with a letter.
  2. Lines starting with a letter that appears earlier in the alphabet will appear before lines starting with a letter that appears later in the alphabet.
  3. Lines starting with a uppercase letter will appear before lines starting with the same letter in lowercase.

Example: 1 > B > a > b

sort mix.txt
1242
Abc
BALL
abc
apple
bat

Note: This command does not actually change the input file, i.e. file.txt. For that you can use output redirections or the -o option, a built-in sort option -o, which allows you to specify an output file.

$ sort mix.txt > output.txt
$ sort -o output.txt mix.txt

Sorting in Reverse Order

You can perform a reverse-order sort using the -r flag. the -r flag is an option of the sort command which sorts the input file in reverse order.

$ sort -r mix.txt  
bat
apple
abc
BALL
Abc
1242

Sorting a file numerically

To sort a file numerically by arithmetic values used –n option.

cat numbers.txt
70
29
18
98
200
1400

first we will sort the file using default parameters:

sort numbers.txt
1400
18
200
29
70
98

This is cleary not numerically sorted. Now we are going to add -n option:

sort -n numbers.txt
18
29
70
98
200
1400

much better!!, and what happend if you want to sort your numeric data in reverse order from greater to lesser. you can use a combination of the two options -r and -n.

sort -rn numbers.txt
1400
200
98
70
29
18

Sorting a file by a column

sort provides the feature of sorting a table on the basis of any column number by using -k option Use the -k option to sort on a certain column. For example, use -k 2 to sort on the second column.

cat Mus_musculus.GRCm38.75_chr1.bed | sort -nk2 | head
1	3054233	3054733
1	3054233	3054733
1	3054233	3054733
1	3102016	3102125
1	3102016	3102125
...

Sort and remove duplicates

To sort and remove duplicates pass the -u option to sort. This will write a sorted list to standard output and remove duplicates, printing only the unique line. This option is helpful as the duplicates being removed give us a redundant file

cat Mus_musculus.GRCm38.75_chr1.bed | sort -unk2 | head
1 3054233 3054733
1 3102016 3102125
1 3205901 3207317
1 3205901 3216344
1 3205901 3671498
1 3213609 3216344
...

Uniq

In simple words, uniq is the tool that helps to detect the adjacent duplicate lines and also deletes the duplicate lines. uniq filters out the adjacent matching lines from the input file(that is required as an argument) and either print/write the filtered data to the stdout/output file.

Syntax of uniq Command :

$uniq [OPTION] [INPUT[OUTPUT]]

Here, INPUT refers to the input file in which repeated lines need to be filtered out and if INPUT.OUTPUT refers to the output file in which you can store the filtered output generated by uniq command and as in the case of INPUT if OUTPUT isn’t specified then uniq writes to the standard output.

uniq isn’t able to detect the duplicate lines unless they are adjacent. The content in the file must be therefore sorted before using uniq or you can simply use sort -u instead of uniq.

$ head Mus_musculus.GRCm38.75_chr1.bed |sort | uniq
3054233
3102016
3205901
3213609

-c -count : It tells how many times a line was repeated by displaying a number as a prefix with the line.

$ head Mus_musculus.GRCm38.75_chr1.bed |sort | uniq -c 
3 3054233
3 3102016
3 3205901
1 3213609

tr

The tr command in UNIX is a command line utility for translating or deleting characters. It supports a range of transformations including uppercase to lowercase, squeezing repeating characters, deleting specific characters and basic find and replace. It can be used with UNIX pipes to support more complex translation. tr stands for translate

$ tr [OPTION] SET1 [SET2]

set1 is/are the character/s to modify by set2 character/s

1. How to convert lower case to upper case

To convert from lower case to upper case the predefined sets in tr can be used.

$cat geekfile
WELCOME TO 
GeeksforGeeks
$cat geekfile | tr “[:lower:]” “[:upper:]”
WELCOME TO
GEEKSFORGEEKS 

or you can use metacharacters:

$cat geekfile | tr “[a-z]” “[A-Z]”
WELCOME TO
GEEKSFORGEEKS 

2. How to translate white-space to tabs

The following command will translate all the white-space to tabs

echo "Welcome To GeeksforGeeks" | tr [:space:] '\t'
Welcome    To    GeeksforGeeks

question: what happen if you try to do the same but using instead of [:space:] only a blank space " "?

3. How to solve the repetition of characters using -s

To removes repeated instances of a character in a set use the -s option. OR we can say that,you can convert multiple continuous spaces with a single space

$ echo "Welcome    To    GeeksforGeeks" | tr -s [:space:] ' '

Output:

Welcome To GeeksforGeeks

4. How to translate braces into parenthesis

You can also translate from and to a file. In this example we will translate braces in a file with parenthesis.

$cat geekfile2
{WELCOME TO} 
GeeksforGeeks
cat geekfile2 | tr '{}' '()'
(WELCOME TO) 
GeeksforGeeks

5. How to delete specified characters using -d option

To delete specific characters use the -d option.This option deletes characters in the first set specified.

$ echo "Welcome To GeeksforGeeks" | tr -d 'W'

Output:

elcome To GeeksforGeeks

excercise: Transform an apply the fasta format to the dna sequence stored in tr_excercise file.

join

The join command in UNIX is a command line utility for joining lines of two files on a common field.

Suppose you have two files and there is a need to combine these two files in a way that the output makes even more sense.For example, there could be a file containing names and the other containing ID’s and the requirement is to combine both files in such a way that the names and corresponding ID’s appear in the same line. join command is the tool for it. join command is used to join the two files based on a key field present in both the files. The input file can be separated by white space or any delimiter. Syntax:

$join [OPTION] FILE1 FILE2

Example : Let us assume there are two files file1.txt and file2.txt and we want to combine the contents of these two files.

// displaying the contents of first file //
$cat file1.txt
1 AAYUSH
2 APAAR
3 HEMANT
4 KARTIK

// displaying contents of second file //
$cat file2.txt
1 101
2 102
3 103
4 104

Now, in order to combine two files the files must have some common field. In this case, we have the numbering 1, 2... as the common field in both the files.

NOTE : When using join command, both the input files should be sorted on the KEY on which we are going to join the files.

$join file1.txt file2.txt
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104

// by default join command takes the 
first column as the key to join as 
in the above case //

So, the output contains the key followed by all the matching columns from the first file file1.txt, followed by all the columns of second file file2.txt.

Using join with options

1. using -a FILENUM option

sometimes it is possible that one of the files contain extra fields so what join command does in that case is that by default, it only prints pairable lines. For example, even if file file1.txt contains an extra field provided that the contents of file2.txt then the output produced by join command would be same:

//displaying the contents of file1.txt//
$cat file1.txt
1 AAYUSH
2 APAAR
3 HEMANT
4 KARTIK
5 DEEPAK

//displaying contents of file2.txt//
$cat file2.txt
1 101
2 102
3 103
4 104

//using join command//
$join file1.txt file2.txt
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104

// although file1.txt has extra field the 
output is not affected cause the 5 column in 
file1.txt was unpairable with any in file2.txt//

What if such unpairable lines are important and must be visible after joining the files. In such cases we can use -a option with join command which will help in displaying such unpairable lines. This option requires the user to pass a file number so that the tool knows which file you are talking about.

//using join with -a option//

//1 is used with -a to display the contents of
first file passed//

$join file1.txt file2.txt -a 1
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104
5 DEEPAK

//5 column of first file is 
also displayed with help of -a option
although it is unpairable//

2.using -1 and -2 option

As we already know that join combines lines of files on a common field, which is first field by default. However, it is not necessary that the common key in the both files always be the first column.join command provides options if the common key is other than the first column. In this case, if you want to use the second field of either file or both the files to be the common field for join, you can do this by using the -1 and -2 command line options. The -1 and -2 here represents the first and second file and these options requires a numeric argument that refers to the joining field for the corresponding file:

Syntax:

join -1[field] -2[field] file1 file2 

This will be easily understandable with the example below:

//displaying contents of first file//
$cat file1.txt
AAYUSH 1
APAAR 2
HEMANT 3
KARTIK 4

//displaying contents of second file//
$cat file2.txt
 101 1
 102 2
 103 3
 104 4

//now using join command //

$join -1 2 -2 2 file1.txt file2.txt
1 AAYUSH 101
2 APAAR 102
3 HEMANT 103
4 KARTIK 104

//here -1 2 refers to the use of 2 column of
first file as the common field and -2 2
refers to the use of 2 column of second
file as the common field for joining//

Column

The column command formats its input into multiple columns. Useful for pretty-printing displays.

Usage:

$ column –s "Separator" –t file

-t Determine the number of columns the input contains and create a table. -s Columns are delimited with whitespace, by default, or with the characters supplied using the -s option

//displaying contents of table file//
cat table

column1 column2
1 2
3 45
43 565
234234 5454532
4223 43252345
2343214 54545
2 454325245
32542 452524

//now using column command with //

column -t table

column1  column2
1        2
3        45
43       565
234234   5454532
4223     43252345
2343214  54545
2        454325245
32542    452524

Excercises

Using the file forest_coverage_percent.csv stored at Resources folder obtein next data.

  1. Number of countries
  2. country column sorted alphabetically
  3. Number of years analyzed
  4. Obtein all the forest coverage data for Germany
  5. which coverage percent was obteined by Germany in 1999?
  6. Which country has the highest coverage percent in 1998 and in the year 2000?