PerlIntro

Perl Introduction

Perl was created by Larry Wall from so-called shell scripting languages, and has become one of the most flexible of the major UNIX programming languages. It also runs on several non-UNIX and legacy operating systems such as MacOS9 and Windows. It has excellent built-in text-handling capabilities, such as regular expressions, which are incredibly useful for sequence data, and it supports several object oriented programming styles. One of the motto of Perl is “there is more than one way to do it” and that’s certainly true. Of course, sometimes, some ways are better than others!

Hello World!

To understand the mechanics of how a program is written and executed, the best approach is to write a very small program that really does not do much, except show that it has executed. These programs are usually referred to as ‘Hello World’ programs, because that’s what they are often programmed to print to the screen, to prove they actually ran.

In Perl, a ‘Hello World’ program is particularly simple.

#!/usr/bin/perl
print 'Hello World!\n';

Editing Perl

Program code should be edited using a text editor (not a word processor such as Word). Text editors usually do not embed any special characters for formatting the fonts, font size, etc, which would interfere with program execution. Great editors for Perl are Emacs and vi. Another category of programs for editing are so-called integrated development environments (IDEs), such as Eclipse. IDEs support editing, building, debugging and running applications, and are particularly useful for languages that have more involved compile and build processes.

Both Emacs and vi are very feature rich and can be considered IDEs in their own right. For example, both have support for syntax highlighting, an important feature which displays the program code with certain elements, such as variables, reserved words, and strings, highlighted in different colors. Indentation of code lines is also automatic. Emacs has a powerful built-in scripting language, Emacs Lisp, which makes Emacs extremely flexible.

Perl scripts should be saved with a .pl extension (such as myscript.pl ). This is just a convention, but most IDEs and text editors will take the hint and automatically switch to Perl highlighting mode (because of the differences in programming languages, almost every language needs to have it’s own specific highlighting).

Executing a Perl script

Stand-alone Perl scripts can be executed directly from emacs, using the execute command. However, often the scripts are executed from the terminal. In that case, to execute the script myscript.pl one can type

perl myscript.pl

If the execute bit is set on the script (using chmod +x myscript.pl in UNIX systems), the script can be executed simply by typing

./myscript.pl

(the leading ./ is required because the ‘current directory’ is usually not in the $PATH . See Introduction to Linux).

However, that last command will only work if the so-called shebang line has been added to the script. This tells the terminal which interpreter to invoke to run the script. This is the first line of our ‘Hello World’ script. It doesn’t work if it is the second line, and the first line is empty, for example. Also, the ‘#’ character needs to be in the first column, and no spaces are allowed between ‘#’ and ‘!’. If this sounds arcane, it is. It was introduced to UNIX in 1979.

The perl command for executing has lots of interesting options. For example, -e, -a, -F, -n, to just name a few interesting ones… They are explained here.

Basic input and output

Writing to the screen (STDOUT)

Reading the keyboard (STDOUT and STDIN)

Program structure

Outside of the shebang line, Perl does not have a lot of requirements for structuring code, but it has lots of options. In their simplest form, Perl programs are just enumerations of Perl commands, separated by semicolons ‘;’. Code can also be grouped in so-called blocks. These code blocks can be executed repeatedly or only under certain conditions. A code block is delimited by ‘{ …. }’ parentheses. Subroutines are special named blocks. Subroutines can be created for pieces of code that need to be executed from different parts of the program. If these subroutines are useful in more than one program, subroutines can be placed in so-called modules .

Perl also allows the programmer to insert comments, which are useful to describe and document the program. The ‘comment-symbol’ is ‘#’, sometimes referred to as ‘hash’ symbol, that we have already seen as part of the shebang line. Essentially, everything, including the #, is ignored by the interpreter up to the end of the line. Note that, conveniently, the interpreter will also ignore the shebang line, which is of course not legal Perl code!

A comment in a script could look like this:

#!/usr/bin/perl
# throw a dice
print int(rand()  6)+1;

Thus, based on the comment, we know what the purpose of this script is! It is astounding how fast one forgets what a script does, and how it does it. Sometimes it is not so easy to read even your own code! Therefore, documentation is an absolute necessity. Perl also includes the POD system for documenting code, which will be introduced in the section: “POD”. It is good practice to include comments and POD with all code.

Basic Perl functions

Perl has a lot of built-in functions, discussed in various sections below. A helpful tool in dealing with functions is perldoc. With the following command you can learn more about the print statement:

perldoc -f print

This is very useful to clarify the parameters and return values of a function. perldoc and a file or module name will display the embedded POD documentation with nice formatting.

Variables

Variables are named data structures. In so-called strongly typed languages, a variable has to be declared with a ‘type’, such as integer (1, 2 etc), real (eg. 3.141569) or string (eg, “blabla”), and the variable will only be able to hold that type of information. Mathematical functions will work only on variables of a certain type, for example. Also, if a variable is assigned to another variable, in these languages, they usually need to be of the same or compatible type.

Perl does not have types in this sense. A variable can hold arbitrary data, numbers, reals, or strings. Therefore, it is impossible to write a function that explicitly only works with variables of type real. This of course has implications for calculations, comparisons, and other operations.

The assignment operator for variables is the equal sign =.

Variables names can be as short as one letter ($a, $b, $n etc), but should really be descriptive of the data they hold, to make the program code more readable. So instead of $n, use something like $first_name. This will slow you down in typing the program somewhat, but it is time well spent! (Emacs even allows you to autocomplete variables names, so it is not such a big issue!). Variable names are case sensitive, so $first_name is not the same variable as $First_Name.

The special value ‘undef’

Before any assignment is made to a variable, it’s value is ‘undef’. A variable can be tested for definedness using the defined() function.

Declaring variables

In the basic mode of Perl, variables do not need to be declared, like in many other languages (java, C, Pascal, etc). However, there is a special mode that can be turned on at the beginning of the script, using the use strict; pragma. After that, all variables need to be declared. In this mode, undeclared variables will throw an error and usually terminate program execution. This is good, because undeclared variables are often just mis-typed variables. Indiscriminately using mis-typed variables will cause severe bugs in your program.Therefore, b. always use use strict; at the beginning of your program (after the shebang line)!

What is the syntax to declare a variable? The short answer is, there is more than one way to do it, but for now, let’s use the somewhat strangely named my operator. Using my, you can declare a variable for example like so:

my $foo;

After this statement, we can use $foo, without typing my. For example, we could write:

$foo = "bar";

Note that string values always have to be enclosed in quotes. Double quotes " are usually used, because English text contains single quotes (for example in “can’t”). Double quotes also allow variable interpolation. print "the value is $foo!\n"; will interpolate the value of $foo, so this line my read the value is bar!. This does not happen with single quotes: print 'the value is $foo\n'; will print the value is $foo.

Perl has a second way of declaring variables, using the our operator. This creates a so-called global variable. Global variables need to be used sometimes, but in general they are better avoided.

Variable scope

The scope denotes the validity of the variable in its context. The scope of a variable declared with my is the current block, including enclosed blocks. It is not valid outside of the block it is declared in. This is called lexical scoping.

Thus, a variable declared as follows works fine:

my $foo = 10;
if (exists($foo)) {  
   # enclosing block - we can use $foo
   print $foo; 
}

The following will give an error:

if ($bar) { 
   my $foo = "baz";
}
print $foo

$foo is not declared outside of the if() block.

Note that the global variables mentioned above have a so-called global scope, which means they are valid in the entire namespace.

Scalars.

Scalars hold a single value. They are of the form:

$var

ie, they start with a dollar symbol, followed by any character (both upper case and lowercase, although mostly lowercase is used for most variables by convention); the underline symbol and numbers are also allowed (although the numbers cannot follow the dollar sign directly).

Some legal variables names therefore would include:

$foo
$first_name
$f77
$SEQ

To assign a value to a variable, use the assignment operator:

$foo = 0.567;

There are a lot of Perl built in functions that operate on scalars. For example, many mathematical functions take a scalar (such as sqr(), int(), sin(), cos()) or string manipulation functions (such as uc(), lc(), etc).

True and false values (boolean values)

Most respectable programming languages have a type called boolean, which represents only the values ‘true’ and ‘false’ (or ‘t’ and ‘f’, etc). Perl does not have a special boolean type, but it will interpret certain values of scalars as false, namely:

undef
0   # the numeric zero value
""  # the empty string

All other values will be interpreted as true. If you need to specifically assign true or false values, most Perl programmers use 0 for false and 1 for true.

Comparisons

The comparison operators are different depending on whether numeric or string values are compared.
The numeric comparison operator is ==. The string comparison operator is eq.

For example, to compare if $a is numerically identical to $b, use == :

my $a = 1.00;
my $b=1;
if ($a == $b) { 
    # do something
}

To compare if the two strings are identical,

my $a = "foo";
my $b = "baz";
if ($a eq $b) {
    # do something
}

Note that the string comparison is of course case sensitive. So the string “FOO” is not equal to “foo”. To perform a case insensitive comparison, you can use the uppercase command uc() (or if you wish, the lowercase command, lc(), which return the uppercase or lowercase of a string, respectively:

if (uc($x) eq uc($y)) { .... # case insensitive comparison!

Lists (Arrays)

Lists, also called arrays, contain multiple values that are ordered and indexed. Some data just comes naturally in such a form. For example, if we have a list of weekdays, such as Monday, Tuesday, … , Sunday, we could put them in a list structure. An array is using an @ symbol instead of the dollar sign, so typical variables would be @weekdays, @months, etc.

Each element in the array has an index, with the first element being, surprisingly, zero, not one. Thus, to refer to the first element, one has to use the list variable in it’s “scalar context” with the index of 0, with the precise notation being $weekdays[0] (note that when you access an aray element, the @ sign is now a dollar sign, indicating the scalar context). $weekdays[1] would refer to Tuesday and so on. What happens if we access element 7? (Remember that the index starts with 0, so the weekdays are represented with indices 0..6). Element 7 therefore should not be defined. Most programming languages would stop execution and throw a digital tantrum. Not so Perl. Perl will just create that element for you, but insert the special value ‘undef’ in the new list element.

Declaring lists

Lists are essentially declared like scalars, using the my notation and are lexically scoped. For example,

my @array;

As with scalars, a value can be assigned to the list in the same statement as the declaration, using the following syntax:

my @weekdays = ("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday");

Perl also provides the convenient qw() function, which makes quoted lists out of simple lists of words (note that it can’t be used with lists that contain spaces in their list elements):

my @weekdays = qw ( Monday Tuesday Wednesday Thursday Friday Saturday Sunday );

That saves some typing, especially of some oddly placed characters on the keyboard! :-)

A particular case is the empty list, which can be declared as follows:

my @tasks = ();

List functions

All of these functions take a list as a parameter unless otherwise noted.

exists() – tests whether a list element exists. For example,

if (exists($weekday[7])) { print "We have a problem...\n"; }

reverse() – reverses the order of the list
sort() – sorts the array (check perldoc -f for how to do different types of sorts)
push @array, $value – push the value onto the list (append it ‘right’)
pop() – remove the last element from the list and return it, for example my

$last = pop @weekdays; # Sunday

shift() – remove the first element (at index 0) from the array and return it. The array will be one element shorter, and the indices for each element will have changed! For example,

my $first = shift @weekdays; # Monday. $weekdays[0] is now Tuesday.

unshift @array, $value – insert a value at the beginning of the list. All indices change (opposite of shift).

Several methods for traversing a list exist: the statement foreach will iterate through each element of the list:

my @array = (1, 2, 3, 4, 5); 
foreach my $n (@array) { 
    print $n*$n ."\n";
}

This will print the square of the number 1 to 5 for example.

A shorter way to write this is can be achieved by using the implicit variable $_: (this is not recommended)

my @array = (1, 2, 3, 4, 5);
foreach (@array) { 
    print $_* $_ ."\n";
}

The for() construct can also be used. This is particularly useful if one needs to know which element is currently accessed:

my @array = qw ( Monday Tuesday Wednesday Thursday Friday Saturday Sunday);

for (my $i=0; $i<@array; $i++) { 
    print "$i th element is $array[$i]\n";
}

Hashes

Hashes are also called associative arrays, and associate values with keys. For example, if we would like to store a list of countries and their capitals, we could store them in two lists, such as

my @countries = qw ( Spain France Japan Germany);
my @capitals = qw ( Madrid Paris Tokyo Berlin );

If we want to know the corresponding capital of a country, we could simply do:

print "The capital of $countries[$n] is $capitals[$n]\n";

However, what happens if we delete a country? We also need to delete the corresponding capital entry. How can we efficiently find a given capital? We always have to scan through the list to find the country, and then use the index to find the capital. For long lists, this is not very efficient. To solve such problems, the hash datatype was introduced.

Declaring a hash

A hash variable has a % symbol instead of the dollar sign of a scalar or @ of a list.
A hash can be declared as follows:

my %hash;

As with lists, a hash can be initialized with values in the declaration statement, with the following syntax:

my %capitals = ( 'Spain'  => 'Madrid,
                 'France' => 'Paris',
                 'Brazil' => 'Brasilia',
                 'Japan'  => 'Tokyo' );

(the entire statement could also have been written on a single line). Note the $key => $value assignment operator, consisting of an equal sign and a greater than sign, with no spaces in between.

It is now a simple matter to retrieve the capital corresponding to a certain country. Note that, as with lists, this is a ‘scalar context’ and we need to use the dollar sign instead of the percent sign; instead of [ ] brackets in list, { } parentheses are used to denote the index:

print $capitals{'Spain'};

will print Madrid.

Iterating through a hash

There are several options for iterating through a hash. Note that in a hash, the values are not stored in any pre-determined order as in lists.

To traverse a hash, the keys or values functions can be used, which return lists (of the keys and of the values in the hash, respectively).

For example,

foreach my $k (keys %capitals) { 
    print "The capital of $k is $capitals{$k}\n";
}

In this example, we iterate through the list of keys returned by the key function. We could apply other functions to that list, such as sort:

foreach my $k (sort (keys %capitals) { 
    print "The capital of $k is $capitals{$k}\n";
}

As with an empty list, an empty hash can be created by assigning %hash=();.

References

Exercises

1) Create three scalar variables, and then print them to screen/STDOUT, delimited by a tab.
2) Create an array with 5 elements, then print the elements to an output file.
3) Open a data file (file xxx.tab from class example directory), read in file line by line and insert data into a hash, storing the data element in the first column the ‘key’ and the element in the second column as the ‘value’. Print a single record to a file, displaying the key and the value.
ex. data file:
s001 Penelope
s002 Ambrose
s004 Dwight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly