The structure of a program depends heavily on the means by which you model your data with appropriate variables.
Where variables allow the abstract manipulation of data, the values they hold make programs concrete and useful. The more accurate your values, the better your programs. These values are data--your aunt's name and address, the distance between your office and a golf course on the moon, or the weight of all of the cookies you've eaten in the past year. Within your program, the rules regarding the format of that data are often strict. Effective programs need effective (simple, fast, most compact, most efficient) ways of representing their data.
A string is a piece of textual or binary data with no particular formatting or contents. It could be your name, the contents of an image file, or your program itself. A string has meaning in the program only when you give it meaning.
To represent a literal string in your program, surround it with a pair of quoting characters. The most common string delimiters are single and double quotes:
Characters in a single-quoted string represent themselves literally, with two exceptions. Embed a single quote inside a single-quoted string by escaping the quote with a leading backslash:
You must also escape any backslash at the end of the string to avoid escaping the closing delimiter and producing a syntax error:
Any other backslash will be part of the string as it appears, unless two backslashes are adjacent, in which case the first will escape the second:
A double-quoted string has several more special characters available. For example, you may encode otherwise invisible whitespace characters in the string:
This demonstrates a useful principle: the syntax used to declare a string may vary. You can represent a tab within a string with the \t
escape or by typing a tab directly. Within Perl's purview, both strings behave the same way, even though the specific representation of the string may differ in the source code.
A string declaration may cross logical newlines; these two declarations are equivalent:
These sequences are often easier to read than their whitespace equivalents.
Perl strings have variable lengths. As you manipulate and modify strings, Perl will change their sizes as appropriate. For example, you can combine multiple strings into a larger string with the concatenation operator .
:
This is effectively the same as if you'd initialized the string all at once.
You may also interpolate the value of a scalar variable or the values of an array within a double-quoted string, such that the current contents of the variable become part of the string as if you'd concatenated them:
Include a literal double-quote inside a double-quoted string by escaping it (that is, preceding it with a leading backslash):
When repeated backslashing becomes unwieldy, use an alternate quoting operator by which you can choose an alternate string delimiter. The q
operator indicates single quoting, while the qq
operator provides double quoting behavior. The character immediately following the operator determines the characters used to delimit the strings. If the character is the opening character of a balanced pair--such as opening and closing braces--the closing character will be the final delimiter. Otherwise, the character itself will be both the starting and ending delimiter.
When declaring a complex string with a series of embedded escapes is tedious, use the heredoc syntax to assign one or more lines of a string:
The <<'END_BLURB'
syntax has three parts. The double angle-brackets introduce the heredoc. The quotes determine whether the heredoc obeys single- or double-quoted behavior. The default behavior is double-quoted interpolation. END_BLURB
is an arbitrary identifier which the Perl 5 parser uses as the ending delimiter.
Be careful; regardless of the indentation of the heredoc declaration itself, the ending delimiter must start at the beginning of the line:
Using a string in a non-string context will induce coercion (coercion).
Unicode is a system for representing the characters of the world's written languages. While most English text uses a character set of only 127 characters (which requires seven bits of storage and fits nicely into eight-bit bytes), it's naïve to believe that you won't someday need an umlaut.
Perl 5 strings can represent either of two separate but related data types:
- Sequences of Unicode characters
-
Each character has a codepoint, a unique number which identifies it in the Unicode character set.
- Sequences of octets
-
Binary data is a sequence of octets--8 bit numbers, each of which can represent a number between 0 and 255.
Unicode strings and binary strings look similar. Each has a length()
. Each supports standard string operations such as concatenation, splicing, and regular expression processing. Any string which is not purely binary data is textual data, and should be a sequence of Unicode characters.
However, because of how your operating system represents data on disk or from users or over the network--as sequences of octets--Perl can't know if the data you read is an image file or a text document or anything else. By default, Perl treats all incoming data as sequences of octets. You must add a specific meaning to that data.
A Unicode string is a sequence of octets which represents a sequence of characters. A Unicode encoding maps octet sequences to characters. Some encodings, such as UTF-8, can encode all of the characters in the Unicode character set. Other encodings represent a subset of Unicode characters. For example, ASCII encodes plain English text with no accented characters, while Latin-1 can represent text in most languages which use the Latin alphabet.
To avoid most Unicode problems, always decode to and from the appropriate encoding at the inputs and outputs of your program.
When you tell Perl that a specific filehandle (files) works with encoded text, Perl will convert the incoming octets to Unicode strings automatically. To do this, add an IO layer to the mode of the open
builtin. An IO layer wraps around input or output and converts the data. In this case, the :utf8
layer decodes UTF-8 data:
You may also modify an existing filehandle with binmode
, whether for input or output:
Without the utf8
mode, printing Unicode strings to a filehandle will result in a warning (Wide character in %s
), because files contain octets, not Unicode characters.
The core module Encode
provides a function named decode()
to convert a scalar containing data to a Unicode string. The corresponding encode()
function converts from Perl's internal encoding to the desired output encoding:
You may include Unicode characters in your programs in three ways. The easiest is to use the utf8
pragma (pragmas), which tells the Perl parser to interpret the rest of the source code file with the UTF-8 encoding. This allows you to use Unicode characters in strings and identifiers:
To write this code, your text editor must understand UTF-8 and you must save the file with the appropriate encoding.
Within double-quoted strings, you may use the Unicode escape sequence to represent character encodings. The syntax \x{}
represents a single character; place the hex form of the character's Unicode number within the curly brackets:
Some Unicode characters have names, and these names are often clearer to read than Unicode numbers. Use the charnames
pragma to enable them and the \N{}
escape to refer to them:
You may use the \x{}
and \N{}
forms within regular expressions as well as anywhere else you may legitimately use a string or a character.
Most Unicode problems in Perl arise from the fact that a string could be either a sequence of octets or a sequence of characters. Perl allows you to combine these types through the use of implicit conversions. When these conversions are wrong, they're rarely obviously wrong.
When Perl concatenates a sequences of octets with a sequence of Unicode characters, it implicitly decodes the octet sequence using the Latin-1 encoding. The resulting string will contain Unicode characters. When you print Unicode characters, Perl will encode the string using UTF-8, because Latin-1 cannot represent the entire set of Unicode characters--Latin-1 is a subset of UTF-8.
This asymmetry can lead to Unicode strings encoded as UTF-8 for output and decoded as Latin-1 when input.
Worse yet, when the text contains only English characters with no accents, the bug hides--because both encodings have the same representation for every character.
If $name
contains an English name such as Alice you will never notice any problem, because the Latin-1 representation is the same as the UTF-8 representation. If $name
contains a name such as José, $name
can contain several possible values:
$name
contains four Unicode characters.$name
contains four Latin-1 octets representing four Unicode characters.$name
contains five UTF-8 octets representing four Unicode characters.
The string literal has several possible scenarios:
It is an ASCII string literal and contains octets.
It is a Latin-1 string literal with no explicit encoding and contains octets.
The string literal contains octets.
It is a non-ASCII string literal with the
utf8
orencoding
pragma in effect and contains Unicode characters.
If both $hello
and $name
are Unicode strings, the concatenation will produce another Unicode string.
If both strings are octet streams, Perl will concatenate them into a new octet string. If both values are octets of the same encoding--both Latin-1, for example, the concatenation will work correctly. If the octets do not share an encoding, for example a concatenation appending UTF-8 data to Latin-1 data, then the resulting sequence of octets makes sense in neither encoding. This could happen if the user entered a name as UTF-8 data and the greeting were a Latin-1 string literal, but the program decoded neither.
If only one of the values is a Unicode string, Perl will decode the other as Latin-1 data. If this is not the correct encoding, the resulting Unicode characters will be wrong. For example, if the user input were UTF-8 data and the string literal were a Unicode string, the name would be incorrectly decoded into five Unicode characters to form José (sic) instead of José because the UTF-8 data means something else when decoded as Latin-1 data.
See perldoc perluniintro
for a far more detailed explanation of Unicode, encodings, and how to manage incoming and outgoing data in a Unicode worldFor far more detail about managing Unicode effectively throughout your programs, see Tom Christiansen's answer to "Why does Modern Perl avoid UTF-8 by default?" http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129.
Perl supports numbers as both integers and floating-point values. You may represent them with scientific notation as well as in binary, octal, and hexadecimal forms:
The emboldened characters are the numeric prefixes for binary, octal, and hex notation respectively. Be aware that a leading zero on an integer always indicates octal mode.
You may not use commas to separate thousands in numeric literals, lest the parser interpret the commas as comma operators. Instead, use underscores within the number. The parser will treat them as invisible characters; your readers may not. These are equivalent:
Consider the most readable alternative.
Because of coercion (coercion), Perl programmers rarely have to worry about converting text read from outside the program to numbers. Perl will treat anything which looks like a number as a number in numeric contexts. In the rare circumstances where you need to know if something looks like a number to Perl, use the looks_like_number
function from the core module Scalar::Util
. This function returns a true value if Perl will consider the given argument numeric.
The Regexp::Common
module from the CPAN provides several well-tested regular expressions to identify more specific valid types (whole number, integer, floating-point value) of numeric values.
Perl 5's undef
value represents an unassigned, undefined, and unknown value. Declared but undefined scalar variables contain undef
:
undef
evaluates to false in boolean context. Evaluating undef
in a string context--such as interpolating it into a string--produces an uninitialized value
warning:
... produces:
The defined
builtin returns a true value if its operand evaluates to a defined value (anything other than undef
):
When used on the right-hand side of an assignment, the ()
construct represents an empty list. In scalar context, this evaluates to undef
. In list context, it is an empty list. When used on the left-hand side of an assignment, the ()
construct imposes list context. To count the number of elements returned from an expression in list context without using a temporary variable, use the idiom (idioms):
Because of the right associativity (associativity) of the assignment operator, Perl first evaluates the second assignment by calling get_all_clown_hats()
in list context. This produces a list.
Assignment to the empty list throws away all of the values of the list, but that assignment takes place in scalar context, which evaluates to the number of items on the right hand side of the assignment. As a result, $count
contains the number of elements in the list returned from get_all_clown_hats()
.
If you find that concept confusing right now, fear not. As you understand how Perl's fundamental design features fit together in practice, it will make more sense.
A list is a comma-separated group of one or more expressions. Lists may occur verbatim in source code as values:
... as targets of assignments:
... or as lists of expressions:
Parentheses do not create lists. The comma operator creates lists. Where present, the parentheses in these examples group expressions to change their precedence (precedence).
Use the range operator to create lists of literals in a compact form:
Use the qw()
operator to split a literal string on whitespace to produce a list of strings:
Lists can (and often do) occur as the results of expressions, but these lists do not appear literally in source code.
Lists and arrays are not interchangeable in Perl. Lists are values. Arrays are containers. You may store a list in an array and you may coerce an array to a list, but they are separate entities. For example, indexing into a list always occurs in list context. Indexing into an array can occur in scalar context (for a single element) or list context (for a slice):
Hey! The above document had some coding errors, which are explained below:
- Around line 3:
-
A non-empty Z<>
- Around line 236:
-
A non-empty Z<>
- Around line 534:
-
Deleting unknown formatting code N<>
Deleting unknown formatting code U<>