Perlretut (1) Linux Manual Page

NAME

perlretut – Perl regular expressions tutorial

DESCRIPTION

This page provides a basic tutorial on understanding, creating and using regular expressions in Perl. It serves as a complement to the reference page on regular expressions perlre. Regular expressions are an integral part of the "m//", "s///", "qr//" and "split" operators and so this tutorial also overlaps with “Regexp Quote-Like Operators” in perlop and “split” in perlfunc.

Perl is widely renowned for excellence in text processing, and regular expressions are one of the big factors behind this fame. Perl regular expressions display an efficiency and flexibility unknown in most other computer languages. Mastering even the basics of regular expressions will allow you to manipulate text with surprising ease.

What is a regular expression? At its most basic, a regular expression is a template that is used to determine if a string has certain characteristics. The string is most often some text, such as a line, sentence, web page, or even a whole book, but less commonly it could be some binary data as well. Suppose we want to determine if the text in variable, $var contains the sequence of characters "m u s h r o o m" (blanks added for legibility). We can write in Perl

 $var =~ m/mushroom/

The value of this expression will be TRUE if $var contains that sequence of characters, and FALSE otherwise. The portion enclosed in '/' characters denotes the characteristic we are looking for. We use the term pattern for it. The process of looking to see if the pattern occurs in the string is called matching, and the "=~" operator along with the "m//" tell Perl to try to match the pattern against the string. Note that the pattern is also a string, but a very special kind of one, as we will see. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., ""ls *.txt"“ or ”"dir *.*"". In Perl, the patterns described by regular expressions are used not only to search strings, but to also extract desired parts of strings, and to do search and replace operations.

Regular expressions have the undeserved reputation of being abstract and difficult to understand. This really stems simply because the notation used to express them tends to be terse and dense, and not because of inherent complexity. We recommend using the "/x" regular expression modifier (described below) along with plenty of white space to make them less dense, and easier to read. Regular expressions are constructed using simple concepts like conditionals and loops and are no more difficult to understand than the corresponding "if" conditionals and "while" loops in the Perl language itself.

This tutorial flattens the learning curve by discussing regular expression concepts, along with their notation, one at a time and with many examples. The first part of the tutorial will progress from the simplest word searches to the basic regular expression concepts. If you master the first part, you will have all the tools needed to solve about 98% of your needs. The second part of the tutorial is for those comfortable with the basics and hungry for more power tools. It discusses the more advanced regular expression operators and introduces the latest cutting-edge innovations.

A note: to save time, “regular expression” is often abbreviated as regexp or regex. Regexp is a more natural abbreviation than regex, but is harder to pronounce. The Perl pod documentation is evenly split on regexp vs regex; in Perl, there is more than one way to abbreviate it. We’ll use regexp in this tutorial.

New in v5.22, "use re 'strict'" applies stricter rules than otherwise when compiling regular expression patterns. It can find things that, while legal, may not be what you intended.

Part 1: The basics

Simple word matching

The simplest regexp is simply a word, or more generally, a string of characters. A regexp consisting of just a word matches any string that contains that word:

    "Hello World" =~ /World/;  # matches

What is this Perl statement all about? "Hello World" is a simple double-quoted string. "World" is the regular expression and the "//" enclosing "/World/" tells Perl to search a string for a match. The operator "=~" associates the string with the regexp match and produces a true value if the regexp matched, or false if the regexp did not match. In our case, "World" matches the second word in "Hello World", so the expression is true. Expressions like this are useful in conditionals:


if ("Hello World" = ~ / World /) {
    print "It matches
";
} else {
    print "It doesn't match
";
}

There are useful variations on this theme. The sense of the match can be reversed by using the "!~" operator:

if (“Hello World” !~ / World /) {
print “It doesn’t match
”;
} else {
print “It matches
”;
}

The literal string in the regexp can be replaced by a variable:

my $greeting = “World”;
if (“Hello World” = ~ / $greeting /) {
print “It matches
”;
} else {
print “It doesn’t match
”;
}

If you’re matching against the special default variable $_, the "$_ =~" part can be omitted:

$_ = “Hello World”;
if (/ World /) {
print “It matches
”;
} else {
print “It doesn’t match
”;
}

And finally, the "//" default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' out front:

“Hello World” = ~m !World !;
#matches, delimited by ‘!’ “Hello World” = ~m{World };
#matches, note the matching ‘{}’ “/usr/bin/perl” = ~m “/perl”;
#matches after ‘/usr/bin’,
#’/’ becomes an ordinary char

"/World/", "m!World!", and "m{World}" all represent the same thing. When, e.g., the quote ('"') is used as a delimiter, the forward slash '/' becomes an ordinary character and can be used in this regexp without trouble.

Let’s consider how different regexps would match "Hello World":

“Hello World” = ~ / world / ;
#doesn’t match “Hello World” = ~ / o W / ;
#matches
“Hello World” = ~ / oW /
;
#doesn’t match “Hello World” = ~ / World / ;
#doesn’t match

The first regexp "world" doesn’t match because regexps are case-sensitive. The second regexp matches because the substring 'o W' occurs in the string "Hello World". The space character ' ' is treated like any other character in a regexp and is needed to match in this case. The lack of a space character is the reason the third regexp 'oW' doesn’t match. The fourth regexp ""World "" doesn’t match because there is a space at the end of the regexp, but not at the end of the string. The lesson here is that regexps must match a part of the string exactly in order for the statement to be true.

If a regexp matches in more than one place in the string, Perl will always match at the earliest possible point in the string:

    "Hello World" =~ /o/;       # matches 'o' in 'Hello'
    "That hat is red" =~ /hat/; # matches 'hat' in 'That'

With respect to character matching, there are a few more points you need to know about. First of all, not all characters can be used “as is” in a match. Some characters, called metacharacters, are generally reserved for use in regexp notation. The metacharacters are

{}[]() ^ $.| *+? -#\

This list is not as definitive as it may appear (or be claimed to be in other documentation). For example, "#" is a metacharacter only when the "/x" pattern modifier (described below) is used, and both "}" and "]" are metacharacters only when paired with opening "{" or "[" respectively; other gotchas apply.

The significance of each of these will be explained in the rest of the tutorial, but for now, it is important only to know that a metacharacter can be matched as-is by putting a backslash before it:

“2+2=4” = ~ / 2 + 2 / ;
#doesn’t match, + is a metacharacter “2+2=4” = ~ / 2\+ 2 / ;
#matches, \+ is treated like an ordinary +
“The interval is [0,1).” = ~ / [0, 1)./ #is a syntax error !”The interval is [0,1).” = ~ /\[0, 1\)\./ #matches
“#!/usr/bin/perl” = ~ / #!\/ usr\/ bin\/ perl /
; #matches

In the last regexp, the forward slash '/' is also backslashed, because it is used to delimit the regexp. This can lead to LTS (leaning toothpick syndrome), however, and it is often more readable to change delimiters.

    "#!/usr/bin/perl" =~ m!#\!/usr/bin/perl!;  # easier to read

The backslash character '\' is a metacharacter itself and needs to be backslashed:

    'C:\WIN32' =~ /C:\WIN/;   # matches

In situations where it doesn’t make sense for a particular metacharacter to mean what it normally does, it automatically loses its metacharacter-ness and becomes an ordinary character that is to be matched literally. For example, the '}' is a metacharacter only when it is the mate of a '{' metacharacter. Otherwise it is treated as a literal RIGHT CURLY BRACKET. This may lead to unexpected results. "use re 'strict'" can catch some of these.

In addition to the metacharacters, there are some ASCII characters which don’t have printable character equivalents and are instead represented by escape sequences. Common examples are " " for a tab, " " for a newline, " " for a carriage return and "" for a bell (or alert). If your string is better thought of as a sequence of arbitrary bytes, the octal escape sequence, e.g., "", or hexadecimal escape sequence, e.g., "" may be a more natural representation for your bytes. Here are some examples of escapes:

    "1000	2000" =~ m(0	2)   # matches

    "1000

2000" =~ /0

20/   # matches

    "1000	2000" =~ /

perlretut (1) Linux Manual Page

NAME

DESCRIPTION

Part 1: The basics

Simple word matching

Leave a Reply Cancel reply