distrib/pod/perlretut.pod

1N/A=head1 NAME
1N/A
1N/Aperlretut - Perl regular expressions tutorial
1N/A
1N/A=head1 DESCRIPTION
1N/A
1N/AThis page provides a basic tutorial on understanding, creating and
1N/Ausing regular expressions in Perl.  It serves as a complement to the
1N/Areference page on regular expressions L<perlre>.  Regular expressions
1N/Aare an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
1N/Aoperators and so this tutorial also overlaps with
1N/AL<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
1N/A
1N/APerl is widely renowned for excellence in text processing, and regular
1N/Aexpressions are one of the big factors behind this fame.  Perl regular
1N/Aexpressions display an efficiency and flexibility unknown in most
1N/Aother computer languages.  Mastering even the basics of regular
1N/Aexpressions will allow you to manipulate text with surprising ease.
1N/A
1N/AWhat is a regular expression?  A regular expression is simply a string
1N/Athat describes a pattern.  Patterns are in common use these days;
1N/Aexamples are the patterns typed into a search engine to find web pages
1N/Aand the patterns used to list files in a directory, e.g., C<ls *.txt>
1N/Aor C<dir *.*>.  In Perl, the patterns described by regular expressions
1N/Aare used to search strings, extract desired parts of strings, and to
1N/Ado search and replace operations.
1N/A
1N/ARegular expressions have the undeserved reputation of being abstract
1N/Aand difficult to understand.  Regular expressions are constructed using
1N/Asimple concepts like conditionals and loops and are no more difficult
1N/Ato understand than the corresponding C<if> conditionals and C<while>
1N/Aloops in the Perl language itself.  In fact, the main challenge in
1N/Alearning regular expressions is just getting used to the terse
1N/Anotation used to express these concepts.
1N/A
1N/AThis tutorial flattens the learning curve by discussing regular
1N/Aexpression concepts, along with their notation, one at a time and with
1N/Amany examples.  The first part of the tutorial will progress from the
1N/Asimplest word searches to the basic regular expression concepts.  If
1N/Ayou master the first part, you will have all the tools needed to solve
1N/Aabout 98% of your needs.  The second part of the tutorial is for those
1N/Acomfortable with the basics and hungry for more power tools.  It
1N/Adiscusses the more advanced regular expression operators and
1N/Aintroduces the latest cutting edge innovations in 5.6.0.
1N/A
1N/AA note: to save time, 'regular expression' is often abbreviated as
1N/Aregexp or regex.  Regexp is a more natural abbreviation than regex, but
1N/Ais harder to pronounce.  The Perl pod documentation is evenly split on
1N/Aregexp vs regex; in Perl, there is more than one way to abbreviate it.
1N/AWe'll use regexp in this tutorial.
1N/A
1N/A=head1 Part 1: The basics
1N/A
1N/A=head2 Simple word matching
1N/A
1N/AThe simplest regexp is simply a word, or more generally, a string of
1N/Acharacters.  A regexp consisting of a word matches any string that
1N/Acontains that word:
1N/A
1N/A    "Hello World" =~ /World/;  # matches
1N/A
1N/AWhat is this perl statement all about? C<"Hello World"> is a simple
1N/Adouble quoted string.  C<World> is the regular expression and the
1N/AC<//> enclosing C</World/> tells perl to search a string for a match.
1N/AThe operator C<=~> associates the string with the regexp match and
1N/Aproduces a true value if the regexp matched, or false if the regexp
1N/Adid not match.  In our case, C<World> matches the second word in
1N/AC<"Hello World">, so the expression is true.  Expressions like this
1N/Aare useful in conditionals:
1N/A
1N/A    if ("Hello World" =~ /World/) {
1N/A        print "It matches\n";
1N/A    }
1N/A    else {
1N/A        print "It doesn't match\n";
1N/A    }
1N/A
1N/AThere are useful variations on this theme.  The sense of the match can
1N/Abe reversed by using C<!~> operator:
1N/A
1N/A    if ("Hello World" !~ /World/) {
1N/A        print "It doesn't match\n";
1N/A    }
1N/A    else {
1N/A        print "It matches\n";
1N/A    }
1N/A
1N/AThe literal string in the regexp can be replaced by a variable:
1N/A
1N/A    $greeting = "World";
1N/A    if ("Hello World" =~ /$greeting/) {
1N/A        print "It matches\n";
1N/A    }
1N/A    else {
1N/A        print "It doesn't match\n";
1N/A    }
1N/A
1N/AIf you're matching against the special default variable C<$_>, the
1N/AC<$_ =~> part can be omitted:
1N/A
1N/A    $_ = "Hello World";
1N/A    if (/World/) {
1N/A        print "It matches\n";
1N/A    }
1N/A    else {
1N/A        print "It doesn't match\n";
1N/A    }
1N/A
1N/AAnd finally, the C<//> default delimiters for a match can be changed
1N/Ato arbitrary delimiters by putting an C<'m'> out front:
1N/A
1N/A    "Hello World" =~ m!World!;   # matches, delimited by '!'
1N/A    "Hello World" =~ m{World};   # matches, note the matching '{}'
1N/A    "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
1N/A                                 # '/' becomes an ordinary char
1N/A
1N/AC</World/>, C<m!World!>, and C<m{World}> all represent the
1N/Asame thing.  When, e.g., C<""> is used as a delimiter, the forward
1N/Aslash C<'/'> becomes an ordinary character and can be used in a regexp
1N/Awithout trouble.
1N/A
1N/ALet's consider how different regexps would match C<"Hello World">:
1N/A
1N/A    "Hello World" =~ /world/;  # doesn't match
1N/A    "Hello World" =~ /o W/;    # matches
1N/A    "Hello World" =~ /oW/;     # doesn't match
1N/A    "Hello World" =~ /World /; # doesn't match
1N/A
1N/AThe first regexp C<world> doesn't match because regexps are
1N/Acase-sensitive.  The second regexp matches because the substring
1N/AS<C<'o W'> > occurs in the string S<C<"Hello World"> >.  The space
1N/Acharacter ' ' is treated like any other character in a regexp and is
1N/Aneeded to match in this case.  The lack of a space character is the
1N/Areason the third regexp C<'oW'> doesn't match.  The fourth regexp
1N/AC<'World '> doesn't match because there is a space at the end of the
1N/Aregexp, but not at the end of the string.  The lesson here is that
1N/Aregexps must match a part of the string I<exactly> in order for the
1N/Astatement to be true.
1N/A
1N/AIf a regexp matches in more than one place in the string, perl will
1N/Aalways match at the earliest possible point in the string:
1N/A
1N/A    "Hello World" =~ /o/;       # matches 'o' in 'Hello'
1N/A    "That hat is red" =~ /hat/; # matches 'hat' in 'That'
1N/A
1N/AWith respect to character matching, there are a few more points you
1N/Aneed to know about.   First of all, not all characters can be used 'as
1N/Ais' in a match.  Some characters, called B<metacharacters>, are reserved
1N/Afor use in regexp notation.  The metacharacters are
1N/A
1N/A    {}[]()^$.|*+?\
1N/A
1N/AThe significance of each of these will be explained
1N/Ain the rest of the tutorial, but for now, it is important only to know
1N/Athat a metacharacter can be matched by putting a backslash before it:
1N/A
1N/A    "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
1N/A    "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
1N/A    "The interval is [0,1)." =~ /[0,1)./     # is a syntax error!
1N/A    "The interval is [0,1)." =~ /\[0,1\)\./  # matches
1N/A    "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
1N/A
1N/AIn the last regexp, the forward slash C<'/'> is also backslashed,
1N/Abecause it is used to delimit the regexp.  This can lead to LTS
1N/A(leaning toothpick syndrome), however, and it is often more readable
1N/Ato change delimiters.
1N/A
1N/A    "/usr/bin/perl" =~ m!/usr/bin/perl!;    # easier to read
1N/A
1N/AThe backslash character C<'\'> is a metacharacter itself and needs to
1N/Abe backslashed:
1N/A
1N/A    'C:\WIN32' =~ /C:\\WIN/;   # matches
1N/A
1N/AIn addition to the metacharacters, there are some ASCII characters
1N/Awhich don't have printable character equivalents and are instead
1N/Arepresented by B<escape sequences>.  Common examples are C<\t> for a
1N/Atab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
1N/Abell.  If your string is better thought of as a sequence of arbitrary
1N/Abytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
1N/Asequence, e.g., C<\x1B> may be a more natural representation for your
1N/Abytes.  Here are some examples of escapes:
1N/A
1N/A    "1000\t2000" =~ m(0\t2)   # matches
1N/A    "1000\n2000" =~ /0\n20/   # matches
1N/A    "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
1N/A    "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat
1N/A
1N/AIf you've been around Perl a while, all this talk of escape sequences
1N/Amay seem familiar.  Similar escape sequences are used in double-quoted
1N/Astrings and in fact the regexps in Perl are mostly treated as
1N/Adouble-quoted strings.  This means that variables can be used in
1N/Aregexps as well.  Just like double-quoted strings, the values of the
1N/Avariables in the regexp will be substituted in before the regexp is
1N/Aevaluated for matching purposes.  So we have:
1N/A
1N/A    $foo = 'house';
1N/A    'housecat' =~ /$foo/;      # matches
1N/A    'cathouse' =~ /cat$foo/;   # matches
1N/A    'housecat' =~ /${foo}cat/; # matches
1N/A
1N/ASo far, so good.  With the knowledge above you can already perform
1N/Asearches with just about any literal string regexp you can dream up.
1N/AHere is a I<very simple> emulation of the Unix grep program:
1N/A
1N/A    % cat > simple_grep
1N/A    #!/usr/bin/perl
1N/A    $regexp = shift;
1N/A    while (<>) {
1N/A        print if /$regexp/;
1N/A    }
1N/A    ^D
1N/A
1N/A    % chmod +x simple_grep
1N/A
1N/A    % simple_grep abba /usr/dict/words
1N/A    Babbage
1N/A    cabbage
1N/A    cabbages
1N/A    sabbath
1N/A    Sabbathize
1N/A    Sabbathizes
1N/A    sabbatical
1N/A    scabbard
1N/A    scabbards
1N/A
1N/AThis program is easy to understand.  C<#!/usr/bin/perl> is the standard
1N/Away to invoke a perl program from the shell.
1N/AS<C<$regexp = shift;> > saves the first command line argument as the
1N/Aregexp to be used, leaving the rest of the command line arguments to
1N/Abe treated as files.  S<C<< while (<>) >> > loops over all the lines in
1N/Aall the files.  For each line, S<C<print if /$regexp/;> > prints the
1N/Aline if the regexp matches the line.  In this line, both C<print> and
1N/AC</$regexp/> use the default variable C<$_> implicitly.
1N/A
1N/AWith all of the regexps above, if the regexp matched anywhere in the
1N/Astring, it was considered a match.  Sometimes, however, we'd like to
1N/Aspecify I<where> in the string the regexp should try to match.  To do
1N/Athis, we would use the B<anchor> metacharacters C<^> and C<$>.  The
1N/Aanchor C<^> means match at the beginning of the string and the anchor
1N/AC<$> means match at the end of the string, or before a newline at the
1N/Aend of the string.  Here is how they are used:
1N/A
1N/A    "housekeeper" =~ /keeper/;    # matches
1N/A    "housekeeper" =~ /^keeper/;   # doesn't match
1N/A    "housekeeper" =~ /keeper$/;   # matches
1N/A    "housekeeper\n" =~ /keeper$/; # matches
1N/A
1N/AThe second regexp doesn't match because C<^> constrains C<keeper> to
1N/Amatch only at the beginning of the string, but C<"housekeeper"> has
1N/Akeeper starting in the middle.  The third regexp does match, since the
1N/AC<$> constrains C<keeper> to match only at the end of the string.
1N/A
1N/AWhen both C<^> and C<$> are used at the same time, the regexp has to
1N/Amatch both the beginning and the end of the string, i.e., the regexp
1N/Amatches the whole string.  Consider
1N/A
1N/A    "keeper" =~ /^keep$/;      # doesn't match
1N/A    "keeper" =~ /^keeper$/;    # matches
1N/A    ""       =~ /^$/;          # ^$ matches an empty string
1N/A
1N/AThe first regexp doesn't match because the string has more to it than
1N/AC<keep>.  Since the second regexp is exactly the string, it
1N/Amatches.  Using both C<^> and C<$> in a regexp forces the complete
1N/Astring to match, so it gives you complete control over which strings
1N/Amatch and which don't.  Suppose you are looking for a fellow named
1N/Abert, off in a string by himself:
1N/A
1N/A    "dogbert" =~ /bert/;   # matches, but not what you want
1N/A
1N/A    "dilbert" =~ /^bert/;  # doesn't match, but ..
1N/A    "bertram" =~ /^bert/;  # matches, so still not good enough
1N/A
1N/A    "bertram" =~ /^bert$/; # doesn't match, good
1N/A    "dilbert" =~ /^bert$/; # doesn't match, good
1N/A    "bert"    =~ /^bert$/; # matches, perfect
1N/A
1N/AOf course, in the case of a literal string, one could just as easily
1N/Ause the string equivalence S<C<$string eq 'bert'> > and it would be
1N/Amore efficient.   The  C<^...$> regexp really becomes useful when we
1N/Aadd in the more powerful regexp tools below.
1N/A
1N/A=head2 Using character classes
1N/A
1N/AAlthough one can already do quite a lot with the literal string
1N/Aregexps above, we've only scratched the surface of regular expression
1N/Atechnology.  In this and subsequent sections we will introduce regexp
1N/Aconcepts (and associated metacharacter notations) that will allow a
1N/Aregexp to not just represent a single character sequence, but a I<whole
1N/Aclass> of them.
1N/A
1N/AOne such concept is that of a B<character class>.  A character class
1N/Aallows a set of possible characters, rather than just a single
1N/Acharacter, to match at a particular point in a regexp.  Character
1N/Aclasses are denoted by brackets C<[...]>, with the set of characters
1N/Ato be possibly matched inside.  Here are some examples:
1N/A
1N/A    /cat/;       # matches 'cat'
1N/A    /[bcr]at/;   # matches 'bat, 'cat', or 'rat'
1N/A    /item[0123456789]/;  # matches 'item0' or ... or 'item9'
1N/A    "abc" =~ /[cab]/;    # matches 'a'
1N/A
1N/AIn the last statement, even though C<'c'> is the first character in
1N/Athe class, C<'a'> matches because the first character position in the
1N/Astring is the earliest point at which the regexp can match.
1N/A
1N/A    /[yY][eE][sS]/;      # match 'yes' in a case-insensitive way
1N/A                         # 'yes', 'Yes', 'YES', etc.
1N/A
1N/AThis regexp displays a common task: perform a case-insensitive
1N/Amatch.  Perl provides away of avoiding all those brackets by simply
1N/Aappending an C<'i'> to the end of the match.  Then C</[yY][eE][sS]/;>
1N/Acan be rewritten as C</yes/i;>.  The C<'i'> stands for
1N/Acase-insensitive and is an example of a B<modifier> of the matching
1N/Aoperation.  We will meet other modifiers later in the tutorial.
1N/A
1N/AWe saw in the section above that there were ordinary characters, which
1N/Arepresented themselves, and special characters, which needed a
1N/Abackslash C<\> to represent themselves.  The same is true in a
1N/Acharacter class, but the sets of ordinary and special characters
1N/Ainside a character class are different than those outside a character
1N/Aclass.  The special characters for a character class are C<-]\^$>.  C<]>
1N/Ais special because it denotes the end of a character class.  C<$> is
1N/Aspecial because it denotes a scalar variable.  C<\> is special because
1N/Ait is used in escape sequences, just like above.  Here is how the
1N/Aspecial characters C<]$\> are handled:
1N/A
1N/A   /[\]c]def/; # matches ']def' or 'cdef'
1N/A   $x = 'bcr';
1N/A   /[$x]at/;   # matches 'bat', 'cat', or 'rat'
1N/A   /[\$x]at/;  # matches '$at' or 'xat'
1N/A   /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
1N/A
1N/AThe last two are a little tricky.  in C<[\$x]>, the backslash protects
1N/Athe dollar sign, so the character class has two members C<$> and C<x>.
1N/AIn C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
1N/Avariable and substituted in double quote fashion.
1N/A
1N/AThe special character C<'-'> acts as a range operator within character
1N/Aclasses, so that a contiguous set of characters can be written as a
1N/Arange.  With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
1N/Abecome the svelte C<[0-9]> and C<[a-z]>.  Some examples are
1N/A
1N/A    /item[0-9]/;  # matches 'item0' or ... or 'item9'
1N/A    /[0-9bx-z]aa/;  # matches '0aa', ..., '9aa',
1N/A                    # 'baa', 'xaa', 'yaa', or 'zaa'
1N/A    /[0-9a-fA-F]/;  # matches a hexadecimal digit
1N/A    /[0-9a-zA-Z_]/; # matches a "word" character,
1N/A                    # like those in a perl variable name
1N/A
1N/AIf C<'-'> is the first or last character in a character class, it is
1N/Atreated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
1N/Aall equivalent.
1N/A
1N/AThe special character C<^> in the first position of a character class
1N/Adenotes a B<negated character class>, which matches any character but
1N/Athose in the brackets.  Both C<[...]> and C<[^...]> must match a
1N/Acharacter, or the match fails.  Then
1N/A
1N/A    /[^a]at/;  # doesn't match 'aat' or 'at', but matches
1N/A               # all other 'bat', 'cat, '0at', '%at', etc.
1N/A    /[^0-9]/;  # matches a non-numeric character
1N/A    /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
1N/A
1N/ANow, even C<[0-9]> can be a bother the write multiple times, so in the
1N/Ainterest of saving keystrokes and making regexps more readable, Perl
1N/Ahas several abbreviations for common character classes:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/A\d is a digit and represents [0-9]
1N/A
1N/A=item *
1N/A
1N/A\s is a whitespace character and represents [\ \t\r\n\f]
1N/A
1N/A=item *
1N/A
1N/A\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
1N/A
1N/A=item *
1N/A
1N/A\D is a negated \d; it represents any character but a digit [^0-9]
1N/A
1N/A=item *
1N/A
1N/A\S is a negated \s; it represents any non-whitespace character [^\s]
1N/A
1N/A=item *
1N/A
1N/A\W is a negated \w; it represents any non-word character [^\w]
1N/A
1N/A=item *
1N/A
1N/AThe period '.' matches any character but "\n"
1N/A
1N/A=back
1N/A
1N/AThe C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
1N/Aof character classes.  Here are some in use:
1N/A
1N/A    /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
1N/A    /[\d\s]/;         # matches any digit or whitespace character
1N/A    /\w\W\w/;         # matches a word char, followed by a
1N/A                      # non-word char, followed by a word char
1N/A    /..rt/;           # matches any two chars, followed by 'rt'
1N/A    /end\./;          # matches 'end.'
1N/A    /end[.]/;         # same thing, matches 'end.'
1N/A
1N/ABecause a period is a metacharacter, it needs to be escaped to match
1N/Aas an ordinary period. Because, for example, C<\d> and C<\w> are sets
1N/Aof characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
1N/Afact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
1N/AC<[\W]>. Think DeMorgan's laws.
1N/A
1N/AAn anchor useful in basic regexps is the S<B<word anchor> >
1N/AC<\b>.  This matches a boundary between a word character and a non-word
1N/Acharacter C<\w\W> or C<\W\w>:
1N/A
1N/A    $x = "Housecat catenates house and cat";
1N/A    $x =~ /cat/;    # matches cat in 'housecat'
1N/A    $x =~ /\bcat/;  # matches cat in 'catenates'
1N/A    $x =~ /cat\b/;  # matches cat in 'housecat'
1N/A    $x =~ /\bcat\b/;  # matches 'cat' at end of string
1N/A
1N/ANote in the last example, the end of the string is considered a word
1N/Aboundary.
1N/A
1N/AYou might wonder why C<'.'> matches everything but C<"\n"> - why not
1N/Aevery character? The reason is that often one is matching against
1N/Alines and would like to ignore the newline characters.  For instance,
1N/Awhile the string C<"\n"> represents one line, we would like to think
1N/Aof as empty.  Then
1N/A
1N/A    ""   =~ /^$/;    # matches
1N/A    "\n" =~ /^$/;    # matches, "\n" is ignored
1N/A
1N/A    ""   =~ /./;      # doesn't match; it needs a char
1N/A    ""   =~ /^.$/;    # doesn't match; it needs a char
1N/A    "\n" =~ /^.$/;    # doesn't match; it needs a char other than "\n"
1N/A    "a"  =~ /^.$/;    # matches
1N/A    "a\n"  =~ /^.$/;  # matches, ignores the "\n"
1N/A
1N/AThis behavior is convenient, because we usually want to ignore
1N/Anewlines when we count and match characters in a line.  Sometimes,
1N/Ahowever, we want to keep track of newlines.  We might even want C<^>
1N/Aand C<$> to anchor at the beginning and end of lines within the
1N/Astring, rather than just the beginning and end of the string.  Perl
1N/Aallows us to choose between ignoring and paying attention to newlines
1N/Aby using the C<//s> and C<//m> modifiers.  C<//s> and C<//m> stand for
1N/Asingle line and multi-line and they determine whether a string is to
1N/Abe treated as one continuous string, or as a set of lines.  The two
1N/Amodifiers affect two aspects of how the regexp is interpreted: 1) how
1N/Athe C<'.'> character class is defined, and 2) where the anchors C<^>
1N/Aand C<$> are able to match.  Here are the four possible combinations:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/Ano modifiers (//): Default behavior.  C<'.'> matches any character
1N/Aexcept C<"\n">.  C<^> matches only at the beginning of the string and
1N/AC<$> matches only at the end or before a newline at the end.
1N/A
1N/A=item *
1N/A
1N/As modifier (//s): Treat string as a single long line.  C<'.'> matches
1N/Aany character, even C<"\n">.  C<^> matches only at the beginning of
1N/Athe string and C<$> matches only at the end or before a newline at the
1N/Aend.
1N/A
1N/A=item *
1N/A
1N/Am modifier (//m): Treat string as a set of multiple lines.  C<'.'>
1N/Amatches any character except C<"\n">.  C<^> and C<$> are able to match
1N/Aat the start or end of I<any> line within the string.
1N/A
1N/A=item *
1N/A
1N/Aboth s and m modifiers (//sm): Treat string as a single long line, but
1N/Adetect multiple lines.  C<'.'> matches any character, even
1N/AC<"\n">.  C<^> and C<$>, however, are able to match at the start or end
1N/Aof I<any> line within the string.
1N/A
1N/A=back
1N/A
1N/AHere are examples of C<//s> and C<//m> in action:
1N/A
1N/A    $x = "There once was a girl\nWho programmed in Perl\n";
1N/A
1N/A    $x =~ /^Who/;   # doesn't match, "Who" not at start of string
1N/A    $x =~ /^Who/s;  # doesn't match, "Who" not at start of string
1N/A    $x =~ /^Who/m;  # matches, "Who" at start of second line
1N/A    $x =~ /^Who/sm; # matches, "Who" at start of second line
1N/A
1N/A    $x =~ /girl.Who/;   # doesn't match, "." doesn't match "\n"
1N/A    $x =~ /girl.Who/s;  # matches, "." matches "\n"
1N/A    $x =~ /girl.Who/m;  # doesn't match, "." doesn't match "\n"
1N/A    $x =~ /girl.Who/sm; # matches, "." matches "\n"
1N/A
1N/AMost of the time, the default behavior is what is want, but C<//s> and
1N/AC<//m> are occasionally very useful.  If C<//m> is being used, the start
1N/Aof the string can still be matched with C<\A> and the end of string
1N/Acan still be matched with the anchors C<\Z> (matches both the end and
1N/Athe newline before, like C<$>), and C<\z> (matches only the end):
1N/A
1N/A    $x =~ /^Who/m;   # matches, "Who" at start of second line
1N/A    $x =~ /\AWho/m;  # doesn't match, "Who" is not at start of string
1N/A
1N/A    $x =~ /girl$/m;  # matches, "girl" at end of first line
1N/A    $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
1N/A
1N/A    $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
1N/A    $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
1N/A
1N/AWe now know how to create choices among classes of characters in a
1N/Aregexp.  What about choices among words or character strings? Such
1N/Achoices are described in the next section.
1N/A
1N/A=head2 Matching this or that
1N/A
1N/ASometimes we would like to our regexp to be able to match different
1N/Apossible words or character strings.  This is accomplished by using
1N/Athe B<alternation> metacharacter C<|>.  To match C<dog> or C<cat>, we
1N/Aform the regexp C<dog|cat>.  As before, perl will try to match the
1N/Aregexp at the earliest possible point in the string.  At each
1N/Acharacter position, perl will first try to match the first
1N/Aalternative, C<dog>.  If C<dog> doesn't match, perl will then try the
1N/Anext alternative, C<cat>.  If C<cat> doesn't match either, then the
1N/Amatch fails and perl moves to the next position in the string.  Some
1N/Aexamples:
1N/A
1N/A    "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
1N/A    "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
1N/A
1N/AEven though C<dog> is the first alternative in the second regexp,
1N/AC<cat> is able to match earlier in the string.
1N/A
1N/A    "cats"          =~ /c|ca|cat|cats/; # matches "c"
1N/A    "cats"          =~ /cats|cat|ca|c/; # matches "cats"
1N/A
1N/AHere, all the alternatives match at the first string position, so the
1N/Afirst alternative is the one that matches.  If some of the
1N/Aalternatives are truncations of the others, put the longest ones first
1N/Ato give them a chance to match.
1N/A
1N/A    "cab" =~ /a|b|c/ # matches "c"
1N/A                     # /a|b|c/ == /[abc]/
1N/A
1N/AThe last example points out that character classes are like
1N/Aalternations of characters.  At a given character position, the first
1N/Aalternative that allows the regexp match to succeed will be the one
1N/Athat matches.
1N/A
1N/A=head2 Grouping things and hierarchical matching
1N/A
1N/AAlternation allows a regexp to choose among alternatives, but by
1N/Aitself it unsatisfying.  The reason is that each alternative is a whole
1N/Aregexp, but sometime we want alternatives for just part of a
1N/Aregexp.  For instance, suppose we want to search for housecats or
1N/Ahousekeepers.  The regexp C<housecat|housekeeper> fits the bill, but is
1N/Ainefficient because we had to type C<house> twice.  It would be nice to
1N/Ahave parts of the regexp be constant, like C<house>, and some
1N/Aparts have alternatives, like C<cat|keeper>.
1N/A
1N/AThe B<grouping> metacharacters C<()> solve this problem.  Grouping
1N/Aallows parts of a regexp to be treated as a single unit.  Parts of a
1N/Aregexp are grouped by enclosing them in parentheses.  Thus we could solve
1N/Athe C<housecat|housekeeper> by forming the regexp as
1N/AC<house(cat|keeper)>.  The regexp C<house(cat|keeper)> means match
1N/AC<house> followed by either C<cat> or C<keeper>.  Some more examples
1N/Aare
1N/A
1N/A    /(a|b)b/;    # matches 'ab' or 'bb'
1N/A    /(ac|b)b/;   # matches 'acb' or 'bb'
1N/A    /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
1N/A    /(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
1N/A
1N/A    /house(cat|)/;  # matches either 'housecat' or 'house'
1N/A    /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
1N/A                        # 'house'.  Note groups can be nested.
1N/A
1N/A    /(19|20|)\d\d/;  # match years 19xx, 20xx, or the Y2K problem, xx
1N/A    "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
1N/A                             # because '20\d\d' can't match
1N/A
1N/AAlternations behave the same way in groups as out of them: at a given
1N/Astring position, the leftmost alternative that allows the regexp to
1N/Amatch is taken.  So in the last example at the first string position,
1N/AC<"20"> matches the second alternative, but there is nothing left over
1N/Ato match the next two digits C<\d\d>.  So perl moves on to the next
1N/Aalternative, which is the null alternative and that works, since
1N/AC<"20"> is two digits.
1N/A
1N/AThe process of trying one alternative, seeing if it matches, and
1N/Amoving on to the next alternative if it doesn't, is called
1N/AB<backtracking>.  The term 'backtracking' comes from the idea that
1N/Amatching a regexp is like a walk in the woods.  Successfully matching
1N/Aa regexp is like arriving at a destination.  There are many possible
1N/Atrailheads, one for each string position, and each one is tried in
1N/Aorder, left to right.  From each trailhead there may be many paths,
1N/Asome of which get you there, and some which are dead ends.  When you
1N/Awalk along a trail and hit a dead end, you have to backtrack along the
1N/Atrail to an earlier point to try another trail.  If you hit your
1N/Adestination, you stop immediately and forget about trying all the
1N/Aother trails.  You are persistent, and only if you have tried all the
1N/Atrails from all the trailheads and not arrived at your destination, do
1N/Ayou declare failure.  To be concrete, here is a step-by-step analysis
1N/Aof what perl does when it tries to match the regexp
1N/A
1N/A    "abcde" =~ /(abd|abc)(df|d|de)/;
1N/A
1N/A=over 4
1N/A
1N/A=item 0
1N/A
1N/AStart with the first letter in the string 'a'.
1N/A
1N/A=item 1
1N/A
1N/ATry the first alternative in the first group 'abd'.
1N/A
1N/A=item 2
1N/A
1N/AMatch 'a' followed by 'b'. So far so good.
1N/A
1N/A=item 3
1N/A
1N/A'd' in the regexp doesn't match 'c' in the string - a dead
1N/Aend.  So backtrack two characters and pick the second alternative in
1N/Athe first group 'abc'.
1N/A
1N/A=item 4
1N/A
1N/AMatch 'a' followed by 'b' followed by 'c'.  We are on a roll
1N/Aand have satisfied the first group. Set $1 to 'abc'.
1N/A
1N/A=item 5
1N/A
1N/AMove on to the second group and pick the first alternative
1N/A'df'.
1N/A
1N/A=item 6
1N/A
1N/AMatch the 'd'.
1N/A
1N/A=item 7
1N/A
1N/A'f' in the regexp doesn't match 'e' in the string, so a dead
1N/Aend.  Backtrack one character and pick the second alternative in the
1N/Asecond group 'd'.
1N/A
1N/A=item 8
1N/A
1N/A'd' matches. The second grouping is satisfied, so set $2 to
1N/A'd'.
1N/A
1N/A=item 9
1N/A
1N/AWe are at the end of the regexp, so we are done! We have
1N/Amatched 'abcd' out of the string "abcde".
1N/A
1N/A=back
1N/A
1N/AThere are a couple of things to note about this analysis.  First, the
1N/Athird alternative in the second group 'de' also allows a match, but we
1N/Astopped before we got to it - at a given character position, leftmost
1N/Awins.  Second, we were able to get a match at the first character
1N/Aposition of the string 'a'.  If there were no matches at the first
1N/Aposition, perl would move to the second character position 'b' and
1N/Aattempt the match all over again.  Only when all possible paths at all
1N/Apossible character positions have been exhausted does perl give
1N/Aup and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false.
1N/A
1N/AEven with all this work, regexp matching happens remarkably fast.  To
1N/Aspeed things up, during compilation stage, perl compiles the regexp
1N/Ainto a compact sequence of opcodes that can often fit inside a
1N/Aprocessor cache.  When the code is executed, these opcodes can then run
1N/Aat full throttle and search very quickly.
1N/A
1N/A=head2 Extracting matches
1N/A
1N/AThe grouping metacharacters C<()> also serve another completely
1N/Adifferent function: they allow the extraction of the parts of a string
1N/Athat matched.  This is very useful to find out what matched and for
1N/Atext processing in general.  For each grouping, the part that matched
1N/Ainside goes into the special variables C<$1>, C<$2>, etc.  They can be
1N/Aused just as ordinary variables:
1N/A
1N/A    # extract hours, minutes, seconds
1N/A    if ($time =~ /(\d\d):(\d\d):(\d\d)/) {    # match hh:mm:ss format
1N/A    $hours = $1;
1N/A    $minutes = $2;
1N/A    $seconds = $3;
1N/A    }
1N/A
1N/ANow, we know that in scalar context,
1N/AS<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false
1N/Avalue.  In list context, however, it returns the list of matched values
1N/AC<($1,$2,$3)>.  So we could write the code more compactly as
1N/A
1N/A    # extract hours, minutes, seconds
1N/A    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
1N/A
1N/AIf the groupings in a regexp are nested, C<$1> gets the group with the
1N/Aleftmost opening parenthesis, C<$2> the next opening parenthesis,
1N/Aetc.  For example, here is a complex regexp and the matching variables
1N/Aindicated below it:
1N/A
1N/A    /(ab(cd|ef)((gi)|j))/;
1N/A     1  2      34
1N/A
1N/Aso that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For
1N/Aconvenience, perl sets C<$+> to the string held by the highest numbered
1N/AC<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the
1N/Avalue of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>,
1N/AC<$2>, ... associated with the rightmost closing parenthesis used in the
1N/Amatch).
1N/A
1N/AClosely associated with the matching variables C<$1>, C<$2>, ... are
1N/Athe B<backreferences> C<\1>, C<\2>, ... .  Backreferences are simply
1N/Amatching variables that can be used I<inside> a regexp.  This is a
1N/Areally nice feature - what matches later in a regexp can depend on
1N/Awhat matched earlier in the regexp.  Suppose we wanted to look
1N/Afor doubled words in text, like 'the the'.  The following regexp finds
1N/Aall 3-letter doubles with a space in between:
1N/A
1N/A    /(\w\w\w)\s\1/;
1N/A
1N/AThe grouping assigns a value to \1, so that the same 3 letter sequence
1N/Ais used for both parts.  Here are some words with repeated parts:
1N/A
1N/A    % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words
1N/A    beriberi
1N/A    booboo
1N/A    coco
1N/A    mama
1N/A    murmur
1N/A    papa
1N/A
1N/AThe regexp has a single grouping which considers 4-letter
1N/Acombinations, then 3-letter combinations, etc.  and uses C<\1> to look for
1N/Aa repeat.  Although C<$1> and C<\1> represent the same thing, care should be
1N/Ataken to use matched variables C<$1>, C<$2>, ... only outside a regexp
1N/Aand backreferences C<\1>, C<\2>, ... only inside a regexp; not doing
1N/Aso may lead to surprising and/or undefined results.
1N/A
1N/AIn addition to what was matched, Perl 5.6.0 also provides the
1N/Apositions of what was matched with the C<@-> and C<@+>
1N/Aarrays. C<$-[0]> is the position of the start of the entire match and
1N/AC<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
1N/Aposition of the start of the C<$n> match and C<$+[n]> is the position
1N/Aof the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
1N/Athis code
1N/A
1N/A    $x = "Mmm...donut, thought Homer";
1N/A    $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
1N/A    foreach $expr (1..$#-) {
1N/A        print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
1N/A    }
1N/A
1N/Aprints
1N/A
1N/A    Match 1: 'Mmm' at position (0,3)
1N/A    Match 2: 'donut' at position (6,11)
1N/A
1N/AEven if there are no groupings in a regexp, it is still possible to
1N/Afind out what exactly matched in a string.  If you use them, perl
1N/Awill set C<$`> to the part of the string before the match, will set C<$&>
1N/Ato the part of the string that matched, and will set C<$'> to the part
1N/Aof the string after the match.  An example:
1N/A
1N/A    $x = "the cat caught the mouse";
1N/A    $x =~ /cat/;  # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
1N/A    $x =~ /the/;  # $` = '', $& = 'the', $' = ' cat caught the mouse'
1N/A
1N/AIn the second match, S<C<$` = ''> > because the regexp matched at the
1N/Afirst character position in the string and stopped, it never saw the
1N/Asecond 'the'.  It is important to note that using C<$`> and C<$'>
1N/Aslows down regexp matching quite a bit, and C< $& > slows it down to a
1N/Alesser extent, because if they are used in one regexp in a program,
1N/Athey are generated for <all> regexps in the program.  So if raw
1N/Aperformance is a goal of your application, they should be avoided.
1N/AIf you need them, use C<@-> and C<@+> instead:
1N/A
1N/A    $` is the same as substr( $x, 0, $-[0] )
1N/A    $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
1N/A    $' is the same as substr( $x, $+[0] )
1N/A
1N/A=head2 Matching repetitions
1N/A
1N/AThe examples in the previous section display an annoying weakness.  We
1N/Awere only matching 3-letter words, or syllables of 4 letters or
1N/Aless.  We'd like to be able to match words or syllables of any length,
1N/Awithout writing out tedious alternatives like
1N/AC<\w\w\w\w|\w\w\w|\w\w|\w>.
1N/A
1N/AThis is exactly the problem the B<quantifier> metacharacters C<?>,
1N/AC<*>, C<+>, and C<{}> were created for.  They allow us to determine the
1N/Anumber of repeats of a portion of a regexp we consider to be a
1N/Amatch.  Quantifiers are put immediately after the character, character
1N/Aclass, or grouping that we want to specify.  They have the following
1N/Ameanings:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AC<a?> = match 'a' 1 or 0 times
1N/A
1N/A=item *
1N/A
1N/AC<a*> = match 'a' 0 or more times, i.e., any number of times
1N/A
1N/A=item *
1N/A
1N/AC<a+> = match 'a' 1 or more times, i.e., at least once
1N/A
1N/A=item *
1N/A
1N/AC<a{n,m}> = match at least C<n> times, but not more than C<m>
1N/Atimes.
1N/A
1N/A=item *
1N/A
1N/AC<a{n,}> = match at least C<n> or more times
1N/A
1N/A=item *
1N/A
1N/AC<a{n}> = match exactly C<n> times
1N/A
1N/A=back
1N/A
1N/AHere are some examples:
1N/A
1N/A    /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
1N/A                     # any number of digits
1N/A    /(\w+)\s+\1/;    # match doubled words of arbitrary length
1N/A    /y(es)?/i;       # matches 'y', 'Y', or a case-insensitive 'yes'
1N/A    $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
1N/A                         # than 4 digits
1N/A    $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates
1N/A    $year =~ /\d{2}(\d{2})?/;  # same thing written differently. However,
1N/A                               # this produces $1 and the other does not.
1N/A
1N/A    % simple_grep '^(\w+)\1$' /usr/dict/words   # isn't this easier?
1N/A    beriberi
1N/A    booboo
1N/A    coco
1N/A    mama
1N/A    murmur
1N/A    papa
1N/A
1N/AFor all of these quantifiers, perl will try to match as much of the
1N/Astring as possible, while still allowing the regexp to succeed.  Thus
1N/Awith C</a?.../>, perl will first try to match the regexp with the C<a>
1N/Apresent; if that fails, perl will try to match the regexp without the
1N/AC<a> present.  For the quantifier C<*>, we get the following:
1N/A
1N/A    $x = "the cat in the hat";
1N/A    $x =~ /^(.*)(cat)(.*)$/; # matches,
1N/A                             # $1 = 'the '
1N/A                             # $2 = 'cat'
1N/A                             # $3 = ' in the hat'
1N/A
1N/AWhich is what we might expect, the match finds the only C<cat> in the
1N/Astring and locks onto it.  Consider, however, this regexp:
1N/A
1N/A    $x =~ /^(.*)(at)(.*)$/; # matches,
1N/A                            # $1 = 'the cat in the h'
1N/A                            # $2 = 'at'
1N/A                            # $3 = ''   (0 matches)
1N/A
1N/AOne might initially guess that perl would find the C<at> in C<cat> and
1N/Astop there, but that wouldn't give the longest possible string to the
1N/Afirst quantifier C<.*>.  Instead, the first quantifier C<.*> grabs as
1N/Amuch of the string as possible while still having the regexp match.  In
1N/Athis example, that means having the C<at> sequence with the final C<at>
1N/Ain the string.  The other important principle illustrated here is that
1N/Awhen there are two or more elements in a regexp, the I<leftmost>
1N/Aquantifier, if there is one, gets to grab as much the string as
1N/Apossible, leaving the rest of the regexp to fight over scraps.  Thus in
1N/Aour example, the first quantifier C<.*> grabs most of the string, while
1N/Athe second quantifier C<.*> gets the empty string.   Quantifiers that
1N/Agrab as much of the string as possible are called B<maximal match> or
1N/AB<greedy> quantifiers.
1N/A
1N/AWhen a regexp can match a string in several different ways, we can use
1N/Athe principles above to predict which way the regexp will match:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/APrinciple 0: Taken as a whole, any regexp will be matched at the
1N/Aearliest possible position in the string.
1N/A
1N/A=item *
1N/A
1N/APrinciple 1: In an alternation C<a|b|c...>, the leftmost alternative
1N/Athat allows a match for the whole regexp will be the one used.
1N/A
1N/A=item *
1N/A
1N/APrinciple 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
1N/AC<{n,m}> will in general match as much of the string as possible while
1N/Astill allowing the whole regexp to match.
1N/A
1N/A=item *
1N/A
1N/APrinciple 3: If there are two or more elements in a regexp, the
1N/Aleftmost greedy quantifier, if any, will match as much of the string
1N/Aas possible while still allowing the whole regexp to match.  The next
1N/Aleftmost greedy quantifier, if any, will try to match as much of the
1N/Astring remaining available to it as possible, while still allowing the
1N/Awhole regexp to match.  And so on, until all the regexp elements are
1N/Asatisfied.
1N/A
1N/A=back
1N/A
1N/AAs we have seen above, Principle 0 overrides the others - the regexp
1N/Awill be matched as early as possible, with the other principles
1N/Adetermining how the regexp matches at that earliest character
1N/Aposition.
1N/A
1N/AHere is an example of these principles in action:
1N/A
1N/A    $x = "The programming republic of Perl";
1N/A    $x =~ /^(.+)(e|r)(.*)$/;  # matches,
1N/A                              # $1 = 'The programming republic of Pe'
1N/A                              # $2 = 'r'
1N/A                              # $3 = 'l'
1N/A
1N/AThis regexp matches at the earliest string position, C<'T'>.  One
1N/Amight think that C<e>, being leftmost in the alternation, would be
1N/Amatched, but C<r> produces the longest string in the first quantifier.
1N/A
1N/A    $x =~ /(m{1,2})(.*)$/;  # matches,
1N/A                            # $1 = 'mm'
1N/A                            # $2 = 'ing republic of Perl'
1N/A
1N/AHere, The earliest possible match is at the first C<'m'> in
1N/AC<programming>. C<m{1,2}> is the first quantifier, so it gets to match
1N/Aa maximal C<mm>.
1N/A
1N/A    $x =~ /.*(m{1,2})(.*)$/;  # matches,
1N/A                              # $1 = 'm'
1N/A                              # $2 = 'ing republic of Perl'
1N/A
1N/AHere, the regexp matches at the start of the string. The first
1N/Aquantifier C<.*> grabs as much as possible, leaving just a single
1N/AC<'m'> for the second quantifier C<m{1,2}>.
1N/A
1N/A    $x =~ /(.?)(m{1,2})(.*)$/;  # matches,
1N/A                                # $1 = 'a'
1N/A                                # $2 = 'mm'
1N/A                                # $3 = 'ing republic of Perl'
1N/A
1N/AHere, C<.?> eats its maximal one character at the earliest possible
1N/Aposition in the string, C<'a'> in C<programming>, leaving C<m{1,2}>
1N/Athe opportunity to match both C<m>'s. Finally,
1N/A
1N/A    "aXXXb" =~ /(X*)/; # matches with $1 = ''
1N/A
1N/Abecause it can match zero copies of C<'X'> at the beginning of the
1N/Astring.  If you definitely want to match at least one C<'X'>, use
1N/AC<X+>, not C<X*>.
1N/A
1N/ASometimes greed is not good.  At times, we would like quantifiers to
1N/Amatch a I<minimal> piece of string, rather than a maximal piece.  For
1N/Athis purpose, Larry Wall created the S<B<minimal match> > or
1N/AB<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>.  These are
1N/Athe usual quantifiers with a C<?> appended to them.  They have the
1N/Afollowing meanings:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AC<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
1N/A
1N/A=item *
1N/A
1N/AC<a*?> = match 'a' 0 or more times, i.e., any number of times,
1N/Abut as few times as possible
1N/A
1N/A=item *
1N/A
1N/AC<a+?> = match 'a' 1 or more times, i.e., at least once, but
1N/Aas few times as possible
1N/A
1N/A=item *
1N/A
1N/AC<a{n,m}?> = match at least C<n> times, not more than C<m>
1N/Atimes, as few times as possible
1N/A
1N/A=item *
1N/A
1N/AC<a{n,}?> = match at least C<n> times, but as few times as
1N/Apossible
1N/A
1N/A=item *
1N/A
1N/AC<a{n}?> = match exactly C<n> times.  Because we match exactly
1N/AC<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
1N/Anotational consistency.
1N/A
1N/A=back
1N/A
1N/ALet's look at the example above, but with minimal quantifiers:
1N/A
1N/A    $x = "The programming republic of Perl";
1N/A    $x =~ /^(.+?)(e|r)(.*)$/; # matches,
1N/A                              # $1 = 'Th'
1N/A                              # $2 = 'e'
1N/A                              # $3 = ' programming republic of Perl'
1N/A
1N/AThe minimal string that will allow both the start of the string C<^>
1N/Aand the alternation to match is C<Th>, with the alternation C<e|r>
1N/Amatching C<e>.  The second quantifier C<.*> is free to gobble up the
1N/Arest of the string.
1N/A
1N/A    $x =~ /(m{1,2}?)(.*?)$/;  # matches,
1N/A                              # $1 = 'm'
1N/A                              # $2 = 'ming republic of Perl'
1N/A
1N/AThe first string position that this regexp can match is at the first
1N/AC<'m'> in C<programming>. At this position, the minimal C<m{1,2}?>
1N/Amatches just one C<'m'>.  Although the second quantifier C<.*?> would
1N/Aprefer to match no characters, it is constrained by the end-of-string
1N/Aanchor C<$> to match the rest of the string.
1N/A
1N/A    $x =~ /(.*?)(m{1,2}?)(.*)$/;  # matches,
1N/A                                  # $1 = 'The progra'
1N/A                                  # $2 = 'm'
1N/A                                  # $3 = 'ming republic of Perl'
1N/A
1N/AIn this regexp, you might expect the first minimal quantifier C<.*?>
1N/Ato match the empty string, because it is not constrained by a C<^>
1N/Aanchor to match the beginning of the word.  Principle 0 applies here,
1N/Ahowever.  Because it is possible for the whole regexp to match at the
1N/Astart of the string, it I<will> match at the start of the string.  Thus
1N/Athe first quantifier has to match everything up to the first C<m>.  The
1N/Asecond minimal quantifier matches just one C<m> and the third
1N/Aquantifier matches the rest of the string.
1N/A
1N/A    $x =~ /(.??)(m{1,2})(.*)$/;  # matches,
1N/A                                 # $1 = 'a'
1N/A                                 # $2 = 'mm'
1N/A                                 # $3 = 'ing republic of Perl'
1N/A
1N/AJust as in the previous regexp, the first quantifier C<.??> can match
1N/Aearliest at position C<'a'>, so it does.  The second quantifier is
1N/Agreedy, so it matches C<mm>, and the third matches the rest of the
1N/Astring.
1N/A
1N/AWe can modify principle 3 above to take into account non-greedy
1N/Aquantifiers:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/APrinciple 3: If there are two or more elements in a regexp, the
1N/Aleftmost greedy (non-greedy) quantifier, if any, will match as much
1N/A(little) of the string as possible while still allowing the whole
1N/Aregexp to match.  The next leftmost greedy (non-greedy) quantifier, if
1N/Aany, will try to match as much (little) of the string remaining
1N/Aavailable to it as possible, while still allowing the whole regexp to
1N/Amatch.  And so on, until all the regexp elements are satisfied.
1N/A
1N/A=back
1N/A
1N/AJust like alternation, quantifiers are also susceptible to
1N/Abacktracking.  Here is a step-by-step analysis of the example
1N/A
1N/A    $x = "the cat in the hat";
1N/A    $x =~ /^(.*)(at)(.*)$/; # matches,
1N/A                            # $1 = 'the cat in the h'
1N/A                            # $2 = 'at'
1N/A                            # $3 = ''   (0 matches)
1N/A
1N/A=over 4
1N/A
1N/A=item 0
1N/A
1N/AStart with the first letter in the string 't'.
1N/A
1N/A=item 1
1N/A
1N/AThe first quantifier '.*' starts out by matching the whole
1N/Astring 'the cat in the hat'.
1N/A
1N/A=item 2
1N/A
1N/A'a' in the regexp element 'at' doesn't match the end of the
1N/Astring.  Backtrack one character.
1N/A
1N/A=item 3
1N/A
1N/A'a' in the regexp element 'at' still doesn't match the last
1N/Aletter of the string 't', so backtrack one more character.
1N/A
1N/A=item 4
1N/A
1N/ANow we can match the 'a' and the 't'.
1N/A
1N/A=item 5
1N/A
1N/AMove on to the third element '.*'.  Since we are at the end of
1N/Athe string and '.*' can match 0 times, assign it the empty string.
1N/A
1N/A=item 6
1N/A
1N/AWe are done!
1N/A
1N/A=back
1N/A
1N/AMost of the time, all this moving forward and backtracking happens
1N/Aquickly and searching is fast.   There are some pathological regexps,
1N/Ahowever, whose execution time exponentially grows with the size of the
1N/Astring.  A typical structure that blows up in your face is of the form
1N/A
1N/A    /(a|b+)*/;
1N/A
1N/AThe problem is the nested indeterminate quantifiers.  There are many
1N/Adifferent ways of partitioning a string of length n between the C<+>
1N/Aand C<*>: one repetition with C<b+> of length n, two repetitions with
1N/Athe first C<b+> length k and the second with length n-k, m repetitions
1N/Awhose bits add up to length n, etc.  In fact there are an exponential
1N/Anumber of ways to partition a string as a function of length.  A
1N/Aregexp may get lucky and match early in the process, but if there is
1N/Ano match, perl will try I<every> possibility before giving up.  So be
1N/Acareful with nested C<*>'s, C<{n,m}>'s, and C<+>'s.  The book
1N/AI<Mastering regular expressions> by Jeffrey Friedl gives a wonderful
1N/Adiscussion of this and other efficiency issues.
1N/A
1N/A=head2 Building a regexp
1N/A
1N/AAt this point, we have all the basic regexp concepts covered, so let's
1N/Agive a more involved example of a regular expression.  We will build a
1N/Aregexp that matches numbers.
1N/A
1N/AThe first task in building a regexp is to decide what we want to match
1N/Aand what we want to exclude.  In our case, we want to match both
1N/Aintegers and floating point numbers and we want to reject any string
1N/Athat isn't a number.
1N/A
1N/AThe next task is to break the problem down into smaller problems that
1N/Aare easily converted into a regexp.
1N/A
1N/AThe simplest case is integers.  These consist of a sequence of digits,
1N/Awith an optional sign in front.  The digits we can represent with
1N/AC<\d+> and the sign can be matched with C<[+-]>.  Thus the integer
1N/Aregexp is
1N/A
1N/A    /[+-]?\d+/;  # matches integers
1N/A
1N/AA floating point number potentially has a sign, an integral part, a
1N/Adecimal point, a fractional part, and an exponent.  One or more of these
1N/Aparts is optional, so we need to check out the different
1N/Apossibilities.  Floating point numbers which are in proper form include
1N/A123., 0.345, .34, -1e6, and 25.4E-72.  As with integers, the sign out
1N/Afront is completely optional and can be matched by C<[+-]?>.  We can
1N/Asee that if there is no exponent, floating point numbers must have a
1N/Adecimal point, otherwise they are integers.  We might be tempted to
1N/Amodel these with C<\d*\.\d*>, but this would also match just a single
1N/Adecimal point, which is not a number.  So the three cases of floating
1N/Apoint number sans exponent are
1N/A
1N/A   /[+-]?\d+\./;  # 1., 321., etc.
1N/A   /[+-]?\.\d+/;  # .1, .234, etc.
1N/A   /[+-]?\d+\.\d+/;  # 1.0, 30.56, etc.
1N/A
1N/AThese can be combined into a single regexp with a three-way alternation:
1N/A
1N/A   /[+-]?(\d+\.\d+|\d+\.|\.\d+)/;  # floating point, no exponent
1N/A
1N/AIn this alternation, it is important to put C<'\d+\.\d+'> before
1N/AC<'\d+\.'>.  If C<'\d+\.'> were first, the regexp would happily match that
1N/Aand ignore the fractional part of the number.
1N/A
1N/ANow consider floating point numbers with exponents.  The key
1N/Aobservation here is that I<both> integers and numbers with decimal
1N/Apoints are allowed in front of an exponent.  Then exponents, like the
1N/Aoverall sign, are independent of whether we are matching numbers with
1N/Aor without decimal points, and can be 'decoupled' from the
1N/Amantissa.  The overall form of the regexp now becomes clear:
1N/A
1N/A    /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
1N/A
1N/AThe exponent is an C<e> or C<E>, followed by an integer.  So the
1N/Aexponent regexp is
1N/A
1N/A   /[eE][+-]?\d+/;  # exponent
1N/A
1N/APutting all the parts together, we get a regexp that matches numbers:
1N/A
1N/A   /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/;  # Ta da!
1N/A
1N/ALong regexps like this may impress your friends, but can be hard to
1N/Adecipher.  In complex situations like this, the C<//x> modifier for a
1N/Amatch is invaluable.  It allows one to put nearly arbitrary whitespace
1N/Aand comments into a regexp without affecting their meaning.  Using it,
1N/Awe can rewrite our 'extended' regexp in the more pleasing form
1N/A
1N/A   /^
1N/A      [+-]?         # first, match an optional sign
1N/A      (             # then match integers or f.p. mantissas:
1N/A          \d+\.\d+  # mantissa of the form a.b
1N/A         |\d+\.     # mantissa of the form a.
1N/A         |\.\d+     # mantissa of the form .b
1N/A         |\d+       # integer of the form a
1N/A      )
1N/A      ([eE][+-]?\d+)?  # finally, optionally match an exponent
1N/A   $/x;
1N/A
1N/AIf whitespace is mostly irrelevant, how does one include space
1N/Acharacters in an extended regexp? The answer is to backslash it
1N/AS<C<'\ '> > or put it in a character class S<C<[ ]> >.  The same thing
1N/Agoes for pound signs, use C<\#> or C<[#]>.  For instance, Perl allows
1N/Aa space between the sign and the mantissa/integer, and we could add
1N/Athis to our regexp as follows:
1N/A
1N/A   /^
1N/A      [+-]?\ *      # first, match an optional sign *and space*
1N/A      (             # then match integers or f.p. mantissas:
1N/A          \d+\.\d+  # mantissa of the form a.b
1N/A         |\d+\.     # mantissa of the form a.
1N/A         |\.\d+     # mantissa of the form .b
1N/A         |\d+       # integer of the form a
1N/A      )
1N/A      ([eE][+-]?\d+)?  # finally, optionally match an exponent
1N/A   $/x;
1N/A
1N/AIn this form, it is easier to see a way to simplify the
1N/Aalternation.  Alternatives 1, 2, and 4 all start with C<\d+>, so it
1N/Acould be factored out:
1N/A
1N/A   /^
1N/A      [+-]?\ *      # first, match an optional sign
1N/A      (             # then match integers or f.p. mantissas:
1N/A          \d+       # start out with a ...
1N/A          (
1N/A              \.\d* # mantissa of the form a.b or a.
1N/A          )?        # ? takes care of integers of the form a
1N/A         |\.\d+     # mantissa of the form .b
1N/A      )
1N/A      ([eE][+-]?\d+)?  # finally, optionally match an exponent
1N/A   $/x;
1N/A
1N/Aor written in the compact form,
1N/A
1N/A    /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;
1N/A
1N/AThis is our final regexp.  To recap, we built a regexp by
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/Aspecifying the task in detail,
1N/A
1N/A=item *
1N/A
1N/Abreaking down the problem into smaller parts,
1N/A
1N/A=item *
1N/A
1N/Atranslating the small parts into regexps,
1N/A
1N/A=item *
1N/A
1N/Acombining the regexps,
1N/A
1N/A=item *
1N/A
1N/Aand optimizing the final combined regexp.
1N/A
1N/A=back
1N/A
1N/AThese are also the typical steps involved in writing a computer
1N/Aprogram.  This makes perfect sense, because regular expressions are
1N/Aessentially programs written a little computer language that specifies
1N/Apatterns.
1N/A
1N/A=head2 Using regular expressions in Perl
1N/A
1N/AThe last topic of Part 1 briefly covers how regexps are used in Perl
1N/Aprograms.  Where do they fit into Perl syntax?
1N/A
1N/AWe have already introduced the matching operator in its default
1N/AC</regexp/> and arbitrary delimiter C<m!regexp!> forms.  We have used
1N/Athe binding operator C<=~> and its negation C<!~> to test for string
1N/Amatches.  Associated with the matching operator, we have discussed the
1N/Asingle line C<//s>, multi-line C<//m>, case-insensitive C<//i> and
1N/Aextended C<//x> modifiers.
1N/A
1N/AThere are a few more things you might want to know about matching
1N/Aoperators.  First, we pointed out earlier that variables in regexps are
1N/Asubstituted before the regexp is evaluated:
1N/A
1N/A    $pattern = 'Seuss';
1N/A    while (<>) {
1N/A        print if /$pattern/;
1N/A    }
1N/A
1N/AThis will print any lines containing the word C<Seuss>.  It is not as
1N/Aefficient as it could be, however, because perl has to re-evaluate
1N/AC<$pattern> each time through the loop.  If C<$pattern> won't be
1N/Achanging over the lifetime of the script, we can add the C<//o>
1N/Amodifier, which directs perl to only perform variable substitutions
1N/Aonce:
1N/A
1N/A    #!/usr/bin/perl
1N/A    #    Improved simple_grep
1N/A    $regexp = shift;
1N/A    while (<>) {
1N/A        print if /$regexp/o;  # a good deal faster
1N/A    }
1N/A
1N/AIf you change C<$pattern> after the first substitution happens, perl
1N/Awill ignore it.  If you don't want any substitutions at all, use the
1N/Aspecial delimiter C<m''>:
1N/A
1N/A    @pattern = ('Seuss');
1N/A    while (<>) {
1N/A        print if m'@pattern';  # matches literal '@pattern', not 'Seuss'
1N/A    }
1N/A
1N/AC<m''> acts like single quotes on a regexp; all other C<m> delimiters
1N/Aact like double quotes.  If the regexp evaluates to the empty string,
1N/Athe regexp in the I<last successful match> is used instead.  So we have
1N/A
1N/A    "dog" =~ /d/;  # 'd' matches
1N/A    "dogbert =~ //;  # this matches the 'd' regexp used before
1N/A
1N/AThe final two modifiers C<//g> and C<//c> concern multiple matches.
1N/AThe modifier C<//g> stands for global matching and allows the
1N/Amatching operator to match within a string as many times as possible.
1N/AIn scalar context, successive invocations against a string will have
1N/A`C<//g> jump from match to match, keeping track of position in the
1N/Astring as it goes along.  You can get or set the position with the
1N/AC<pos()> function.
1N/A
1N/AThe use of C<//g> is shown in the following example.  Suppose we have
1N/Aa string that consists of words separated by spaces.  If we know how
1N/Amany words there are in advance, we could extract the words using
1N/Agroupings:
1N/A
1N/A    $x = "cat dog house"; # 3 words
1N/A    $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
1N/A                                           # $1 = 'cat'
1N/A                                           # $2 = 'dog'
1N/A                                           # $3 = 'house'
1N/A
1N/ABut what if we had an indeterminate number of words? This is the sort
1N/Aof task C<//g> was made for.  To extract all words, form the simple
1N/Aregexp C<(\w+)> and loop over all matches with C</(\w+)/g>:
1N/A
1N/A    while ($x =~ /(\w+)/g) {
1N/A        print "Word is $1, ends at position ", pos $x, "\n";
1N/A    }
1N/A
1N/Aprints
1N/A
1N/A    Word is cat, ends at position 3
1N/A    Word is dog, ends at position 7
1N/A    Word is house, ends at position 13
1N/A
1N/AA failed match or changing the target string resets the position.  If
1N/Ayou don't want the position reset after failure to match, add the
1N/AC<//c>, as in C</regexp/gc>.  The current position in the string is
1N/Aassociated with the string, not the regexp.  This means that different
1N/Astrings have different positions and their respective positions can be
1N/Aset or read independently.
1N/A
1N/AIn list context, C<//g> returns a list of matched groupings, or if
1N/Athere are no groupings, a list of matches to the whole regexp.  So if
1N/Awe wanted just the words, we could use
1N/A
1N/A    @words = ($x =~ /(\w+)/g);  # matches,
1N/A                                # $word[0] = 'cat'
1N/A                                # $word[1] = 'dog'
1N/A                                # $word[2] = 'house'
1N/A
1N/AClosely associated with the C<//g> modifier is the C<\G> anchor.  The
1N/AC<\G> anchor matches at the point where the previous C<//g> match left
1N/Aoff.  C<\G> allows us to easily do context-sensitive matching:
1N/A
1N/A    $metric = 1;  # use metric units
1N/A    ...
1N/A    $x = <FILE>;  # read in measurement
1N/A    $x =~ /^([+-]?\d+)\s*/g;  # get magnitude
1N/A    $weight = $1;
1N/A    if ($metric) { # error checking
1N/A        print "Units error!" unless $x =~ /\Gkg\./g;
1N/A    }
1N/A    else {
1N/A        print "Units error!" unless $x =~ /\Glbs\./g;
1N/A    }
1N/A    $x =~ /\G\s+(widget|sprocket)/g;  # continue processing
1N/A
1N/AThe combination of C<//g> and C<\G> allows us to process the string a
1N/Abit at a time and use arbitrary Perl logic to decide what to do next.
1N/ACurrently, the C<\G> anchor is only fully supported when used to anchor
1N/Ato the start of the pattern.
1N/A
1N/AC<\G> is also invaluable in processing fixed length records with
1N/Aregexps.  Suppose we have a snippet of coding region DNA, encoded as
1N/Abase pair letters C<ATCGTTGAAT...> and we want to find all the stop
1N/Acodons C<TGA>.  In a coding region, codons are 3-letter sequences, so
1N/Awe can think of the DNA snippet as a sequence of 3-letter records.  The
1N/Anaive regexp
1N/A
1N/A    # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
1N/A    $dna = "ATCGTTGAATGCAAATGACATGAC";
1N/A    $dna =~ /TGA/;
1N/A
1N/Adoesn't work; it may match a C<TGA>, but there is no guarantee that
1N/Athe match is aligned with codon boundaries, e.g., the substring
1N/AS<C<GTT GAA> > gives a match.  A better solution is
1N/A
1N/A    while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
1N/A        print "Got a TGA stop codon at position ", pos $dna, "\n";
1N/A    }
1N/A
1N/Awhich prints
1N/A
1N/A    Got a TGA stop codon at position 18
1N/A    Got a TGA stop codon at position 23
1N/A
1N/APosition 18 is good, but position 23 is bogus.  What happened?
1N/A
1N/AThe answer is that our regexp works well until we get past the last
1N/Areal match.  Then the regexp will fail to match a synchronized C<TGA>
1N/Aand start stepping ahead one character position at a time, not what we
1N/Awant.  The solution is to use C<\G> to anchor the match to the codon
1N/Aalignment:
1N/A
1N/A    while ($dna =~ /\G(\w\w\w)*?TGA/g) {
1N/A        print "Got a TGA stop codon at position ", pos $dna, "\n";
1N/A    }
1N/A
1N/AThis prints
1N/A
1N/A    Got a TGA stop codon at position 18
1N/A
1N/Awhich is the correct answer.  This example illustrates that it is
1N/Aimportant not only to match what is desired, but to reject what is not
1N/Adesired.
1N/A
1N/AB<search and replace>
1N/A
1N/ARegular expressions also play a big role in B<search and replace>
1N/Aoperations in Perl.  Search and replace is accomplished with the
1N/AC<s///> operator.  The general form is
1N/AC<s/regexp/replacement/modifiers>, with everything we know about
1N/Aregexps and modifiers applying in this case as well.  The
1N/AC<replacement> is a Perl double quoted string that replaces in the
1N/Astring whatever is matched with the C<regexp>.  The operator C<=~> is
1N/Aalso used here to associate a string with C<s///>.  If matching
1N/Aagainst C<$_>, the S<C<$_ =~> > can be dropped.  If there is a match,
1N/AC<s///> returns the number of substitutions made, otherwise it returns
1N/Afalse.  Here are a few examples:
1N/A
1N/A    $x = "Time to feed the cat!";
1N/A    $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
1N/A    if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
1N/A        $more_insistent = 1;
1N/A    }
1N/A    $y = "'quoted words'";
1N/A    $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
1N/A                           # $y contains "quoted words"
1N/A
1N/AIn the last example, the whole string was matched, but only the part
1N/Ainside the single quotes was grouped.  With the C<s///> operator, the
1N/Amatched variables C<$1>, C<$2>, etc.  are immediately available for use
1N/Ain the replacement expression, so we use C<$1> to replace the quoted
1N/Astring with just what was quoted.  With the global modifier, C<s///g>
1N/Awill search and replace all occurrences of the regexp in the string:
1N/A
1N/A    $x = "I batted 4 for 4";
1N/A    $x =~ s/4/four/;   # doesn't do it all:
1N/A                       # $x contains "I batted four for 4"
1N/A    $x = "I batted 4 for 4";
1N/A    $x =~ s/4/four/g;  # does it all:
1N/A                       # $x contains "I batted four for four"
1N/A
1N/AIf you prefer 'regex' over 'regexp' in this tutorial, you could use
1N/Athe following program to replace it:
1N/A
1N/A    % cat > simple_replace
1N/A    #!/usr/bin/perl
1N/A    $regexp = shift;
1N/A    $replacement = shift;
1N/A    while (<>) {
1N/A        s/$regexp/$replacement/go;
1N/A        print;
1N/A    }
1N/A    ^D
1N/A
1N/A    % simple_replace regexp regex perlretut.pod
1N/A
1N/AIn C<simple_replace> we used the C<s///g> modifier to replace all
1N/Aoccurrences of the regexp on each line and the C<s///o> modifier to
1N/Acompile the regexp only once.  As with C<simple_grep>, both the
1N/AC<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly.
1N/A
1N/AA modifier available specifically to search and replace is the
1N/AC<s///e> evaluation modifier.  C<s///e> wraps an C<eval{...}> around
1N/Athe replacement string and the evaluated result is substituted for the
1N/Amatched substring.  C<s///e> is useful if you need to do a bit of
1N/Acomputation in the process of replacing text.  This example counts
1N/Acharacter frequencies in a line:
1N/A
1N/A    $x = "Bill the cat";
1N/A    $x =~ s/(.)/$chars{$1}++;$1/eg;  # final $1 replaces char with itself
1N/A    print "frequency of '$_' is $chars{$_}\n"
1N/A        foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
1N/A
1N/AThis prints
1N/A
1N/A    frequency of ' ' is 2
1N/A    frequency of 't' is 2
1N/A    frequency of 'l' is 2
1N/A    frequency of 'B' is 1
1N/A    frequency of 'c' is 1
1N/A    frequency of 'e' is 1
1N/A    frequency of 'h' is 1
1N/A    frequency of 'i' is 1
1N/A    frequency of 'a' is 1
1N/A
1N/AAs with the match C<m//> operator, C<s///> can use other delimiters,
1N/Asuch as C<s!!!> and C<s{}{}>, and even C<s{}//>.  If single quotes are
1N/Aused C<s'''>, then the regexp and replacement are treated as single
1N/Aquoted strings and there are no substitutions.  C<s///> in list context
1N/Areturns the same thing as in scalar context, i.e., the number of
1N/Amatches.
1N/A
1N/AB<The split operator>
1N/A
1N/AThe B<C<split> > function can also optionally use a matching operator
1N/AC<m//> to split a string.  C<split /regexp/, string, limit> splits
1N/AC<string> into a list of substrings and returns that list.  The regexp
1N/Ais used to match the character sequence that the C<string> is split
1N/Awith respect to.  The C<limit>, if present, constrains splitting into
1N/Ano more than C<limit> number of strings.  For example, to split a
1N/Astring into words, use
1N/A
1N/A    $x = "Calvin and Hobbes";
1N/A    @words = split /\s+/, $x;  # $word[0] = 'Calvin'
1N/A                               # $word[1] = 'and'
1N/A                               # $word[2] = 'Hobbes'
1N/A
1N/AIf the empty regexp C<//> is used, the regexp always matches and
1N/Athe string is split into individual characters.  If the regexp has
1N/Agroupings, then list produced contains the matched substrings from the
1N/Agroupings as well.  For instance,
1N/A
1N/A    $x = "/usr/bin/perl";
1N/A    @dirs = split m!/!, $x;  # $dirs[0] = ''
1N/A                             # $dirs[1] = 'usr'
1N/A                             # $dirs[2] = 'bin'
1N/A                             # $dirs[3] = 'perl'
1N/A    @parts = split m!(/)!, $x;  # $parts[0] = ''
1N/A                                # $parts[1] = '/'
1N/A                                # $parts[2] = 'usr'
1N/A                                # $parts[3] = '/'
1N/A                                # $parts[4] = 'bin'
1N/A                                # $parts[5] = '/'
1N/A                                # $parts[6] = 'perl'
1N/A
1N/ASince the first character of $x matched the regexp, C<split> prepended
1N/Aan empty initial element to the list.
1N/A
1N/AIf you have read this far, congratulations! You now have all the basic
1N/Atools needed to use regular expressions to solve a wide range of text
1N/Aprocessing problems.  If this is your first time through the tutorial,
1N/Awhy not stop here and play around with regexps a while...  S<Part 2>
1N/Aconcerns the more esoteric aspects of regular expressions and those
1N/Aconcepts certainly aren't needed right at the start.
1N/A
1N/A=head1 Part 2: Power tools
1N/A
1N/AOK, you know the basics of regexps and you want to know more.  If
1N/Amatching regular expressions is analogous to a walk in the woods, then
1N/Athe tools discussed in Part 1 are analogous to topo maps and a
1N/Acompass, basic tools we use all the time.  Most of the tools in part 2
1N/Aare analogous to flare guns and satellite phones.  They aren't used
1N/Atoo often on a hike, but when we are stuck, they can be invaluable.
1N/A
1N/AWhat follows are the more advanced, less used, or sometimes esoteric
1N/Acapabilities of perl regexps.  In Part 2, we will assume you are
1N/Acomfortable with the basics and concentrate on the new features.
1N/A
1N/A=head2 More on characters, strings, and character classes
1N/A
1N/AThere are a number of escape sequences and character classes that we
1N/Ahaven't covered yet.
1N/A
1N/AThere are several escape sequences that convert characters or strings
1N/Abetween upper and lower case.  C<\l> and C<\u> convert the next
1N/Acharacter to lower or upper case, respectively:
1N/A
1N/A    $x = "perl";
1N/A    $string =~ /\u$x/;  # matches 'Perl' in $string
1N/A    $x = "M(rs?|s)\\."; # note the double backslash
1N/A    $string =~ /\l$x/;  # matches 'mr.', 'mrs.', and 'ms.',
1N/A
1N/AC<\L> and C<\U> converts a whole substring, delimited by C<\L> or
1N/AC<\U> and C<\E>, to lower or upper case:
1N/A
1N/A    $x = "This word is in lower case:\L SHOUT\E";
1N/A    $x =~ /shout/;       # matches
1N/A    $x = "I STILL KEYPUNCH CARDS FOR MY 360"
1N/A    $x =~ /\Ukeypunch/;  # matches punch card string
1N/A
1N/AIf there is no C<\E>, case is converted until the end of the
1N/Astring. The regexps C<\L\u$word> or C<\u\L$word> convert the first
1N/Acharacter of C<$word> to uppercase and the rest of the characters to
1N/Alowercase.
1N/A
1N/AControl characters can be escaped with C<\c>, so that a control-Z
1N/Acharacter would be matched with C<\cZ>.  The escape sequence
1N/AC<\Q>...C<\E> quotes, or protects most non-alphabetic characters.   For
1N/Ainstance,
1N/A
1N/A    $x = "\QThat !^*&%~& cat!";
1N/A    $x =~ /\Q!^*&%~&\E/;  # check for rough language
1N/A
1N/AIt does not protect C<$> or C<@>, so that variables can still be
1N/Asubstituted.
1N/A
1N/AWith the advent of 5.6.0, perl regexps can handle more than just the
1N/Astandard ASCII character set.  Perl now supports B<Unicode>, a standard
1N/Afor encoding the character sets from many of the world's written
1N/Alanguages.  Unicode does this by allowing characters to be more than
1N/Aone byte wide.  Perl uses the UTF-8 encoding, in which ASCII characters
1N/Aare still encoded as one byte, but characters greater than C<chr(127)>
1N/Amay be stored as two or more bytes.
1N/A
1N/AWhat does this mean for regexps? Well, regexp users don't need to know
1N/Amuch about perl's internal representation of strings.  But they do need
1N/Ato know 1) how to represent Unicode characters in a regexp and 2) when
1N/Aa matching operation will treat the string to be searched as a
1N/Asequence of bytes (the old way) or as a sequence of Unicode characters
1N/A(the new way).  The answer to 1) is that Unicode characters greater
1N/Athan C<chr(127)> may be represented using the C<\x{hex}> notation,
1N/Awith C<hex> a hexadecimal integer:
1N/A
1N/A    /\x{263a}/;  # match a Unicode smiley face :)
1N/A
1N/AUnicode characters in the range of 128-255 use two hexadecimal digits
1N/Awith braces: C<\x{ab}>.  Note that this is different than C<\xab>,
1N/Awhich is just a hexadecimal byte with no Unicode significance.
1N/A
1N/AB<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use
1N/Autf8> to use any Unicode features.  This is no more the case: for
1N/Aalmost all Unicode processing, the explicit C<utf8> pragma is not
1N/Aneeded.  (The only case where it matters is if your Perl script is in
1N/AUnicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
1N/A
1N/AFiguring out the hexadecimal sequence of a Unicode character you want
1N/Aor deciphering someone else's hexadecimal Unicode regexp is about as
1N/Amuch fun as programming in machine code.  So another way to specify
1N/AUnicode characters is to use the S<B<named character> > escape
1N/Asequence C<\N{name}>.  C<name> is a name for the Unicode character, as
1N/Aspecified in the Unicode standard.  For instance, if we wanted to
1N/Arepresent or match the astrological sign for the planet Mercury, we
1N/Acould use
1N/A
1N/A    use charnames ":full"; # use named chars with Unicode full names
1N/A    $x = "abc\N{MERCURY}def";
1N/A    $x =~ /\N{MERCURY}/;   # matches
1N/A
1N/AOne can also use short names or restrict names to a certain alphabet:
1N/A
1N/A    use charnames ':full';
1N/A    print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
1N/A
1N/A    use charnames ":short";
1N/A    print "\N{greek:Sigma} is an upper-case sigma.\n";
1N/A
1N/A    use charnames qw(greek);
1N/A    print "\N{sigma} is Greek sigma\n";
1N/A
1N/AA list of full names is found in the file Names.txt in the
1N/Alib/perl5/5.X.X/unicore directory.
1N/A
1N/AThe answer to requirement 2), as of 5.6.0, is that if a regexp
1N/Acontains Unicode characters, the string is searched as a sequence of
1N/AUnicode characters.  Otherwise, the string is searched as a sequence of
1N/Abytes.  If the string is being searched as a sequence of Unicode
1N/Acharacters, but matching a single byte is required, we can use the C<\C>
1N/Aescape sequence.  C<\C> is a character class akin to C<.> except that
1N/Ait matches I<any> byte 0-255.  So
1N/A
1N/A    use charnames ":full"; # use named chars with Unicode full names
1N/A    $x = "a";
1N/A    $x =~ /\C/;  # matches 'a', eats one byte
1N/A    $x = "";
1N/A    $x =~ /\C/;  # doesn't match, no bytes to match
1N/A    $x = "\N{MERCURY}";  # two-byte Unicode character
1N/A    $x =~ /\C/;  # matches, but dangerous!
1N/A
1N/AThe last regexp matches, but is dangerous because the string
1N/AI<character> position is no longer synchronized to the string I<byte>
1N/Aposition.  This generates the warning 'Malformed UTF-8
1N/Acharacter'.  The C<\C> is best used for matching the binary data in strings
1N/Awith binary data intermixed with Unicode characters.
1N/A
1N/ALet us now discuss the rest of the character classes.  Just as with
1N/AUnicode characters, there are named Unicode character classes
1N/Arepresented by the C<\p{name}> escape sequence.  Closely associated is
1N/Athe C<\P{name}> character class, which is the negation of the
1N/AC<\p{name}> class.  For example, to match lower and uppercase
1N/Acharacters,
1N/A
1N/A    use charnames ":full"; # use named chars with Unicode full names
1N/A    $x = "BOB";
1N/A    $x =~ /^\p{IsUpper}/;   # matches, uppercase char class
1N/A    $x =~ /^\P{IsUpper}/;   # doesn't match, char class sans uppercase
1N/A    $x =~ /^\p{IsLower}/;   # doesn't match, lowercase char class
1N/A    $x =~ /^\P{IsLower}/;   # matches, char class sans lowercase
1N/A
1N/AHere is the association between some Perl named classes and the
1N/Atraditional Unicode classes:
1N/A
1N/A    Perl class name  Unicode class name or regular expression
1N/A
1N/A    IsAlpha          /^[LM]/
1N/A    IsAlnum          /^[LMN]/
1N/A    IsASCII          $code <= 127
1N/A    IsCntrl          /^C/
1N/A    IsBlank          $code =~ /^(0020|0009)$/ || /^Z[^lp]/
1N/A    IsDigit          Nd
1N/A    IsGraph          /^([LMNPS]|Co)/
1N/A    IsLower          Ll
1N/A    IsPrint          /^([LMNPS]|Co|Zs)/
1N/A    IsPunct          /^P/
1N/A    IsSpace          /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
1N/A    IsSpacePerl      /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
1N/A    IsUpper          /^L[ut]/
1N/A    IsWord           /^[LMN]/ || $code eq "005F"
1N/A    IsXDigit         $code =~ /^00(3[0-9]|[46][1-6])$/
1N/A
1N/AYou can also use the official Unicode class names with the C<\p> and
1N/AC<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase
1N/Aletters, or C<\P{Nd}> for non-digits.  If a C<name> is just one
1N/Aletter, the braces can be dropped.  For instance, C<\pM> is the
1N/Acharacter class of Unicode 'marks', for example accent marks.
1N/AFor the full list see L<perlunicode>.
1N/A
1N/AThe Unicode has also been separated into various sets of charaters
1N/Awhich you can test with C<\p{In...}> (in) and C<\P{In...}> (not in),
1N/Afor example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>.
1N/AFor the full list see L<perlunicode>.
1N/A
1N/AC<\X> is an abbreviation for a character class sequence that includes
1N/Athe Unicode 'combining character sequences'.  A 'combining character
1N/Asequence' is a base character followed by any number of combining
1N/Acharacters.  An example of a combining character is an accent.   Using
1N/Athe Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining
1N/Acharacter sequence with base character C<A> and combining character
1N/AS<C<COMBINING RING> >, which translates in Danish to A with the circle
1N/Aatop it, as in the word Angstrom.  C<\X> is equivalent to C<\PM\pM*}>,
1N/Ai.e., a non-mark followed by one or more marks.
1N/A
1N/AFor the full and latest information about Unicode see the latest
1N/AUnicode standard, or the Unicode Consortium's website http://www.unicode.org/
1N/A
1N/AAs if all those classes weren't enough, Perl also defines POSIX style
1N/Acharacter classes.  These have the form C<[:name:]>, with C<name> the
1N/Aname of the POSIX class.  The POSIX classes are C<alpha>, C<alnum>,
1N/AC<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
1N/AC<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
1N/Aextension to match C<\w>), and C<blank> (a GNU extension).  If C<utf8>
1N/Ais being used, then these classes are defined the same as their
1N/Acorresponding perl Unicode classes: C<[:upper:]> is the same as
1N/AC<\p{IsUpper}>, etc.  The POSIX character classes, however, don't
1N/Arequire using C<utf8>.  The C<[:digit:]>, C<[:word:]>, and
1N/AC<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
1N/Acharacter classes.  To negate a POSIX class, put a C<^> in front of
1N/Athe name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
1N/AC<utf8>, C<\P{IsDigit}>.  The Unicode and POSIX character classes can
1N/Abe used just like C<\d>, with the exception that POSIX character
1N/Aclasses can only be used inside of a character class:
1N/A
1N/A    /\s+[abc[:digit:]xyz]\s*/;  # match a,b,c,x,y,z, or a digit
1N/A    /^=item\s[[:digit:]]/;      # match '=item',
1N/A                                # followed by a space and a digit
1N/A    use charnames ":full";
1N/A    /\s+[abc\p{IsDigit}xyz]\s+/;  # match a,b,c,x,y,z, or a digit
1N/A    /^=item\s\p{IsDigit}/;        # match '=item',
1N/A                                  # followed by a space and a digit
1N/A
1N/AWhew! That is all the rest of the characters and character classes.
1N/A
1N/A=head2 Compiling and saving regular expressions
1N/A
1N/AIn Part 1 we discussed the C<//o> modifier, which compiles a regexp
1N/Ajust once.  This suggests that a compiled regexp is some data structure
1N/Athat can be stored once and used again and again.  The regexp quote
1N/AC<qr//> does exactly that: C<qr/string/> compiles the C<string> as a
1N/Aregexp and transforms the result into a form that can be assigned to a
1N/Avariable:
1N/A
1N/A    $reg = qr/foo+bar?/;  # reg contains a compiled regexp
1N/A
1N/AThen C<$reg> can be used as a regexp:
1N/A
1N/A    $x = "fooooba";
1N/A    $x =~ $reg;     # matches, just like /foo+bar?/
1N/A    $x =~ /$reg/;   # same thing, alternate form
1N/A
1N/AC<$reg> can also be interpolated into a larger regexp:
1N/A
1N/A    $x =~ /(abc)?$reg/;  # still matches
1N/A
1N/AAs with the matching operator, the regexp quote can use different
1N/Adelimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>.  The single quote
1N/Adelimiters C<qr''> prevent any interpolation from taking place.
1N/A
1N/APre-compiled regexps are useful for creating dynamic matches that
1N/Adon't need to be recompiled each time they are encountered.  Using
1N/Apre-compiled regexps, C<simple_grep> program can be expanded into a
1N/Aprogram that matches multiple patterns:
1N/A
1N/A    % cat > multi_grep
1N/A    #!/usr/bin/perl
1N/A    # multi_grep - match any of <number> regexps
1N/A    # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
1N/A
1N/A    $number = shift;
1N/A    $regexp[$_] = shift foreach (0..$number-1);
1N/A    @compiled = map qr/$_/, @regexp;
1N/A    while ($line = <>) {
1N/A        foreach $pattern (@compiled) {
1N/A            if ($line =~ /$pattern/) {
1N/A                print $line;
1N/A                last;  # we matched, so move onto the next line
1N/A            }
1N/A        }
1N/A    }
1N/A    ^D
1N/A
1N/A    % multi_grep 2 last for multi_grep
1N/A        $regexp[$_] = shift foreach (0..$number-1);
1N/A            foreach $pattern (@compiled) {
1N/A                    last;
1N/A
1N/AStoring pre-compiled regexps in an array C<@compiled> allows us to
1N/Asimply loop through the regexps without any recompilation, thus gaining
1N/Aflexibility without sacrificing speed.
1N/A
1N/A=head2 Embedding comments and modifiers in a regular expression
1N/A
1N/AStarting with this section, we will be discussing Perl's set of
1N/AB<extended patterns>.  These are extensions to the traditional regular
1N/Aexpression syntax that provide powerful new tools for pattern
1N/Amatching.  We have already seen extensions in the form of the minimal
1N/Amatching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>.  The
1N/Arest of the extensions below have the form C<(?char...)>, where the
1N/AC<char> is a character that determines the type of extension.
1N/A
1N/AThe first extension is an embedded comment C<(?#text)>.  This embeds a
1N/Acomment into the regular expression without affecting its meaning.  The
1N/Acomment should not have any closing parentheses in the text.  An
1N/Aexample is
1N/A
1N/A    /(?# Match an integer:)[+-]?\d+/;
1N/A
1N/AThis style of commenting has been largely superseded by the raw,
1N/Afreeform commenting that is allowed with the C<//x> modifier.
1N/A
1N/AThe modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in
1N/Aa regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>.  For instance,
1N/A
1N/A    /(?i)yes/;  # match 'yes' case insensitively
1N/A    /yes/i;     # same thing
1N/A    /(?x)(          # freeform version of an integer regexp
1N/A             [+-]?  # match an optional sign
1N/A             \d+    # match a sequence of digits
1N/A         )
1N/A    /x;
1N/A
1N/AEmbedded modifiers can have two important advantages over the usual
1N/Amodifiers.  Embedded modifiers allow a custom set of modifiers to
1N/AI<each> regexp pattern.  This is great for matching an array of regexps
1N/Athat must have different modifiers:
1N/A
1N/A    $pattern[0] = '(?i)doctor';
1N/A    $pattern[1] = 'Johnson';
1N/A    ...
1N/A    while (<>) {
1N/A        foreach $patt (@pattern) {
1N/A            print if /$patt/;
1N/A        }
1N/A    }
1N/A
1N/AThe second advantage is that embedded modifiers only affect the regexp
1N/Ainside the group the embedded modifier is contained in.  So grouping
1N/Acan be used to localize the modifier's effects:
1N/A
1N/A    /Answer: ((?i)yes)/;  # matches 'Answer: yes', 'Answer: YES', etc.
1N/A
1N/AEmbedded modifiers can also turn off any modifiers already present
1N/Aby using, e.g., C<(?-i)>.  Modifiers can also be combined into
1N/Aa single expression, e.g., C<(?s-i)> turns on single line mode and
1N/Aturns off case insensitivity.
1N/A
1N/A=head2 Non-capturing groupings
1N/A
1N/AWe noted in Part 1 that groupings C<()> had two distinct functions: 1)
1N/Agroup regexp elements together as a single unit, and 2) extract, or
1N/Acapture, substrings that matched the regexp in the
1N/Agrouping.  Non-capturing groupings, denoted by C<(?:regexp)>, allow the
1N/Aregexp to be treated as a single unit, but don't extract substrings or
1N/Aset matching variables C<$1>, etc.  Both capturing and non-capturing
1N/Agroupings are allowed to co-exist in the same regexp.  Because there is
1N/Ano extraction, non-capturing groupings are faster than capturing
1N/Agroupings.  Non-capturing groupings are also handy for choosing exactly
1N/Awhich parts of a regexp are to be extracted to matching variables:
1N/A
1N/A    # match a number, $1-$4 are set, but we only want $1
1N/A    /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/;
1N/A
1N/A    # match a number faster , only $1 is set
1N/A    /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/;
1N/A
1N/A    # match a number, get $1 = whole number, $2 = exponent
1N/A    /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;
1N/A
1N/ANon-capturing groupings are also useful for removing nuisance
1N/Aelements gathered from a split operation:
1N/A
1N/A    $x = '12a34b5';
1N/A    @num = split /(a|b)/, $x;    # @num = ('12','a','34','b','5')
1N/A    @num = split /(?:a|b)/, $x;  # @num = ('12','34','5')
1N/A
1N/ANon-capturing groupings may also have embedded modifiers:
1N/AC<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp>
1N/Acase insensitively and turns off multi-line mode.
1N/A
1N/A=head2 Looking ahead and looking behind
1N/A
1N/AThis section concerns the lookahead and lookbehind assertions.  First,
1N/Aa little background.
1N/A
1N/AIn Perl regular expressions, most regexp elements 'eat up' a certain
1N/Aamount of string when they match.  For instance, the regexp element
1N/AC<[abc}]> eats up one character of the string when it matches, in the
1N/Asense that perl moves to the next character position in the string
1N/Aafter the match.  There are some elements, however, that don't eat up
1N/Acharacters (advance the character position) if they match.  The examples
1N/Awe have seen so far are the anchors.  The anchor C<^> matches the
1N/Abeginning of the line, but doesn't eat any characters.  Similarly, the
1N/Aword boundary anchor C<\b> matches, e.g., if the character to the left
1N/Ais a word character and the character to the right is a non-word
1N/Acharacter, but it doesn't eat up any characters itself.  Anchors are
1N/Aexamples of 'zero-width assertions'.  Zero-width, because they consume
1N/Ano characters, and assertions, because they test some property of the
1N/Astring.  In the context of our walk in the woods analogy to regexp
1N/Amatching, most regexp elements move us along a trail, but anchors have
1N/Aus stop a moment and check our surroundings.  If the local environment
1N/Achecks out, we can proceed forward.  But if the local environment
1N/Adoesn't satisfy us, we must backtrack.
1N/A
1N/AChecking the environment entails either looking ahead on the trail,
1N/Alooking behind, or both.  C<^> looks behind, to see that there are no
1N/Acharacters before.  C<$> looks ahead, to see that there are no
1N/Acharacters after.  C<\b> looks both ahead and behind, to see if the
1N/Acharacters on either side differ in their 'word'-ness.
1N/A
1N/AThe lookahead and lookbehind assertions are generalizations of the
1N/Aanchor concept.  Lookahead and lookbehind are zero-width assertions
1N/Athat let us specify which characters we want to test for.  The
1N/Alookahead assertion is denoted by C<(?=regexp)> and the lookbehind
1N/Aassertion is denoted by C<< (?<=fixed-regexp) >>.  Some examples are
1N/A
1N/A    $x = "I catch the housecat 'Tom-cat' with catnip";
1N/A    $x =~ /cat(?=\s+)/;  # matches 'cat' in 'housecat'
1N/A    @catwords = ($x =~ /(?<=\s)cat\w+/g);  # matches,
1N/A                                           # $catwords[0] = 'catch'
1N/A                                           # $catwords[1] = 'catnip'
1N/A    $x =~ /\bcat\b/;  # matches 'cat' in 'Tom-cat'
1N/A    $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
1N/A                              # middle of $x
1N/A
1N/ANote that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
1N/Anon-capturing, since these are zero-width assertions.  Thus in the
1N/Asecond regexp, the substrings captured are those of the whole regexp
1N/Aitself.  Lookahead C<(?=regexp)> can match arbitrary regexps, but
1N/Alookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
1N/Awidth, i.e., a fixed number of characters long.  Thus
1N/AC<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not.  The
1N/Anegated versions of the lookahead and lookbehind assertions are
1N/Adenoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
1N/AThey evaluate true if the regexps do I<not> match:
1N/A
1N/A    $x = "foobar";
1N/A    $x =~ /foo(?!bar)/;  # doesn't match, 'bar' follows 'foo'
1N/A    $x =~ /foo(?!baz)/;  # matches, 'baz' doesn't follow 'foo'
1N/A    $x =~ /(?<!\s)foo/;  # matches, there is no \s before 'foo'
1N/A
1N/AThe C<\C> is unsupported in lookbehind, because the already
1N/Atreacherous definition of C<\C> would become even more so
1N/Awhen going backwards.
1N/A
1N/A=head2 Using independent subexpressions to prevent backtracking
1N/A
1N/AThe last few extended patterns in this tutorial are experimental as of
1N/A5.6.0.  Play with them, use them in some code, but don't rely on them
1N/Ajust yet for production code.
1N/A
1N/AS<B<Independent subexpressions> > are regular expressions, in the
1N/Acontext of a larger regular expression, that function independently of
1N/Athe larger regular expression.  That is, they consume as much or as
1N/Alittle of the string as they wish without regard for the ability of
1N/Athe larger regexp to match.  Independent subexpressions are represented
1N/Aby C<< (?>regexp) >>.  We can illustrate their behavior by first
1N/Aconsidering an ordinary regexp:
1N/A
1N/A    $x = "ab";
1N/A    $x =~ /a*ab/;  # matches
1N/A
1N/AThis obviously matches, but in the process of matching, the
1N/Asubexpression C<a*> first grabbed the C<a>.  Doing so, however,
1N/Awouldn't allow the whole regexp to match, so after backtracking, C<a*>
1N/Aeventually gave back the C<a> and matched the empty string.  Here, what
1N/AC<a*> matched was I<dependent> on what the rest of the regexp matched.
1N/A
1N/AContrast that with an independent subexpression:
1N/A
1N/A    $x =~ /(?>a*)ab/;  # doesn't match!
1N/A
1N/AThe independent subexpression C<< (?>a*) >> doesn't care about the rest
1N/Aof the regexp, so it sees an C<a> and grabs it.  Then the rest of the
1N/Aregexp C<ab> cannot match.  Because C<< (?>a*) >> is independent, there
1N/Ais no backtracking and the independent subexpression does not give
1N/Aup its C<a>.  Thus the match of the regexp as a whole fails.  A similar
1N/Abehavior occurs with completely independent regexps:
1N/A
1N/A    $x = "ab";
1N/A    $x =~ /a*/g;   # matches, eats an 'a'
1N/A    $x =~ /\Gab/g; # doesn't match, no 'a' available
1N/A
1N/AHere C<//g> and C<\G> create a 'tag team' handoff of the string from
1N/Aone regexp to the other.  Regexps with an independent subexpression are
1N/Amuch like this, with a handoff of the string to the independent
1N/Asubexpression, and a handoff of the string back to the enclosing
1N/Aregexp.
1N/A
1N/AThe ability of an independent subexpression to prevent backtracking
1N/Acan be quite useful.  Suppose we want to match a non-empty string
1N/Aenclosed in parentheses up to two levels deep.  Then the following
1N/Aregexp matches:
1N/A
1N/A    $x = "abc(de(fg)h";  # unbalanced parentheses
1N/A    $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x;
1N/A
1N/AThe regexp matches an open parenthesis, one or more copies of an
1N/Aalternation, and a close parenthesis.  The alternation is two-way, with
1N/Athe first alternative C<[^()]+> matching a substring with no
1N/Aparentheses and the second alternative C<\([^()]*\)>  matching a
1N/Asubstring delimited by parentheses.  The problem with this regexp is
1N/Athat it is pathological: it has nested indeterminate quantifiers
1N/Aof the form C<(a+|b)+>.  We discussed in Part 1 how nested quantifiers
1N/Alike this could take an exponentially long time to execute if there
1N/Awas no match possible.  To prevent the exponential blowup, we need to
1N/Aprevent useless backtracking at some point.  This can be done by
1N/Aenclosing the inner quantifier as an independent subexpression:
1N/A
1N/A    $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x;
1N/A
1N/AHere, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning
1N/Aby gobbling up as much of the string as possible and keeping it.   Then
1N/Amatch failures fail much more quickly.
1N/A
1N/A=head2 Conditional expressions
1N/A
1N/AA S<B<conditional expression> > is a form of if-then-else statement
1N/Athat allows one to choose which patterns are to be matched, based on
1N/Asome condition.  There are two types of conditional expression:
1N/AC<(?(condition)yes-regexp)> and
1N/AC<(?(condition)yes-regexp|no-regexp)>.  C<(?(condition)yes-regexp)> is
1N/Alike an S<C<'if () {}'> > statement in Perl.  If the C<condition> is true,
1N/Athe C<yes-regexp> will be matched.  If the C<condition> is false, the
1N/AC<yes-regexp> will be skipped and perl will move onto the next regexp
1N/Aelement.  The second form is like an S<C<'if () {} else {}'> > statement
1N/Ain Perl.  If the C<condition> is true, the C<yes-regexp> will be
1N/Amatched, otherwise the C<no-regexp> will be matched.
1N/A
1N/AThe C<condition> can have two forms.  The first form is simply an
1N/Ainteger in parentheses C<(integer)>.  It is true if the corresponding
1N/Abackreference C<\integer> matched earlier in the regexp.  The second
1N/Aform is a bare zero width assertion C<(?...)>, either a
1N/Alookahead, a lookbehind, or a code assertion (discussed in the next
1N/Asection).
1N/A
1N/AThe integer form of the C<condition> allows us to choose, with more
1N/Aflexibility, what to match based on what matched earlier in the
1N/Aregexp. This searches for words of the form C<"$x$x"> or
1N/AC<"$x$y$y$x">:
1N/A
1N/A    % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words
1N/A    beriberi
1N/A    coco
1N/A    couscous
1N/A    deed
1N/A    ...
1N/A    toot
1N/A    toto
1N/A    tutu
1N/A
1N/AThe lookbehind C<condition> allows, along with backreferences,
1N/Aan earlier part of the match to influence a later part of the
1N/Amatch.  For instance,
1N/A
1N/A    /[ATGC]+(?(?<=AA)G|C)$/;
1N/A
1N/Amatches a DNA sequence such that it either ends in C<AAG>, or some
1N/Aother base pair combination and C<C>.  Note that the form is
1N/AC<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
1N/Alookahead, lookbehind or code assertions, the parentheses around the
1N/Aconditional are not needed.
1N/A
1N/A=head2 A bit of magic: executing Perl code in a regular expression
1N/A
1N/ANormally, regexps are a part of Perl expressions.
1N/AS<B<Code evaluation> > expressions turn that around by allowing
1N/Aarbitrary Perl code to be a part of a regexp.  A code evaluation
1N/Aexpression is denoted C<(?{code})>, with C<code> a string of Perl
1N/Astatements.
1N/A
1N/ACode expressions are zero-width assertions, and the value they return
1N/Adepends on their environment.  There are two possibilities: either the
1N/Acode expression is used as a conditional in a conditional expression
1N/AC<(?(condition)...)>, or it is not.  If the code expression is a
1N/Aconditional, the code is evaluated and the result (i.e., the result of
1N/Athe last statement) is used to determine truth or falsehood.  If the
1N/Acode expression is not used as a conditional, the assertion always
1N/Aevaluates true and the result is put into the special variable
1N/AC<$^R>.  The variable C<$^R> can then be used in code expressions later
1N/Ain the regexp.  Here are some silly examples:
1N/A
1N/A    $x = "abcdef";
1N/A    $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
1N/A                                         # prints 'Hi Mom!'
1N/A    $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
1N/A                                         # no 'Hi Mom!'
1N/A
1N/APay careful attention to the next example:
1N/A
1N/A    $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
1N/A                                         # no 'Hi Mom!'
1N/A                                         # but why not?
1N/A
1N/AAt first glance, you'd think that it shouldn't print, because obviously
1N/Athe C<ddd> isn't going to match the target string. But look at this
1N/Aexample:
1N/A
1N/A    $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
1N/A                                           # but _does_ print
1N/A
1N/AHmm. What happened here? If you've been following along, you know that
1N/Athe above pattern should be effectively the same as the last one --
1N/Aenclosing the d in a character class isn't going to change what it
1N/Amatches. So why does the first not print while the second one does?
1N/A
1N/AThe answer lies in the optimizations the REx engine makes. In the first
1N/Acase, all the engine sees are plain old characters (aside from the
1N/AC<?{}> construct). It's smart enough to realize that the string 'ddd'
1N/Adoesn't occur in our target string before actually running the pattern
1N/Athrough. But in the second case, we've tricked it into thinking that our
1N/Apattern is more complicated than it is. It takes a look, sees our
1N/Acharacter class, and decides that it will have to actually run the
1N/Apattern to determine whether or not it matches, and in the process of
1N/Arunning it hits the print statement before it discovers that we don't
1N/Ahave a match.
1N/A
1N/ATo take a closer look at how the engine does optimizations, see the
1N/Asection L<"Pragmas and debugging"> below.
1N/A
1N/AMore fun with C<?{}>:
1N/A
1N/A    $x =~ /(?{print "Hi Mom!";})/;       # matches,
1N/A                                         # prints 'Hi Mom!'
1N/A    $x =~ /(?{$c = 1;})(?{print "$c";})/;  # matches,
1N/A                                           # prints '1'
1N/A    $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
1N/A                                           # prints '1'
1N/A
1N/AThe bit of magic mentioned in the section title occurs when the regexp
1N/Abacktracks in the process of searching for a match.  If the regexp
1N/Abacktracks over a code expression and if the variables used within are
1N/Alocalized using C<local>, the changes in the variables produced by the
1N/Acode expression are undone! Thus, if we wanted to count how many times
1N/Aa character got matched inside a group, we could use, e.g.,
1N/A
1N/A    $x = "aaaa";
1N/A    $count = 0;  # initialize 'a' count
1N/A    $c = "bob";  # test if $c gets clobbered
1N/A    $x =~ /(?{local $c = 0;})         # initialize count
1N/A           ( a                        # match 'a'
1N/A             (?{local $c = $c + 1;})  # increment count
1N/A           )*                         # do this any number of times,
1N/A           aa                         # but match 'aa' at the end
1N/A           (?{$count = $c;})          # copy local $c var into $count
1N/A          /x;
1N/A    print "'a' count is $count, \$c variable is '$c'\n";
1N/A
1N/AThis prints
1N/A
1N/A    'a' count is 2, $c variable is 'bob'
1N/A
1N/AIf we replace the S<C< (?{local $c = $c + 1;})> > with
1N/AS<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone
1N/Aduring backtracking, and we get
1N/A
1N/A    'a' count is 4, $c variable is 'bob'
1N/A
1N/ANote that only localized variable changes are undone.  Other side
1N/Aeffects of code expression execution are permanent.  Thus
1N/A
1N/A    $x = "aaaa";
1N/A    $x =~ /(a(?{print "Yow\n";}))*aa/;
1N/A
1N/Aproduces
1N/A
1N/A   Yow
1N/A   Yow
1N/A   Yow
1N/A   Yow
1N/A
1N/AThe result C<$^R> is automatically localized, so that it will behave
1N/Aproperly in the presence of backtracking.
1N/A
1N/AThis example uses a code expression in a conditional to match the
1N/Aarticle 'the' in either English or German:
1N/A
1N/A    $lang = 'DE';  # use German
1N/A    ...
1N/A    $text = "das";
1N/A    print "matched\n"
1N/A        if $text =~ /(?(?{
1N/A                          $lang eq 'EN'; # is the language English?
1N/A                         })
1N/A                       the |             # if so, then match 'the'
1N/A                       (die|das|der)     # else, match 'die|das|der'
1N/A                     )
1N/A                    /xi;
1N/A
1N/ANote that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not
1N/AC<(?((?{...}))yes-regexp|no-regexp)>.  In other words, in the case of a
1N/Acode expression, we don't need the extra parentheses around the
1N/Aconditional.
1N/A
1N/AIf you try to use code expressions with interpolating variables, perl
1N/Amay surprise you:
1N/A
1N/A    $bar = 5;
1N/A    $pat = '(?{ 1 })';
1N/A    /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
1N/A    /foo(?{ 1 })$bar/;   # compile error!
1N/A    /foo${pat}bar/;      # compile error!
1N/A
1N/A    $pat = qr/(?{ $foo = 1 })/;  # precompile code regexp
1N/A    /foo${pat}bar/;      # compiles ok
1N/A
1N/AIf a regexp has (1) code expressions and interpolating variables,or
1N/A(2) a variable that interpolates a code expression, perl treats the
1N/Aregexp as an error. If the code expression is precompiled into a
1N/Avariable, however, interpolating is ok. The question is, why is this
1N/Aan error?
1N/A
1N/AThe reason is that variable interpolation and code expressions
1N/Atogether pose a security risk.  The combination is dangerous because
1N/Amany programmers who write search engines often take user input and
1N/Aplug it directly into a regexp:
1N/A
1N/A    $regexp = <>;       # read user-supplied regexp
1N/A    $chomp $regexp;     # get rid of possible newline
1N/A    $text =~ /$regexp/; # search $text for the $regexp
1N/A
1N/AIf the C<$regexp> variable contains a code expression, the user could
1N/Athen execute arbitrary Perl code.  For instance, some joker could
1N/Asearch for S<C<system('rm -rf *');> > to erase your files.  In this
1N/Asense, the combination of interpolation and code expressions B<taints>
1N/Ayour regexp.  So by default, using both interpolation and code
1N/Aexpressions in the same regexp is not allowed.  If you're not
1N/Aconcerned about malicious users, it is possible to bypass this
1N/Asecurity check by invoking S<C<use re 'eval'> >:
1N/A
1N/A    use re 'eval';       # throw caution out the door
1N/A    $bar = 5;
1N/A    $pat = '(?{ 1 })';
1N/A    /foo(?{ 1 })$bar/;   # compiles ok
1N/A    /foo${pat}bar/;      # compiles ok
1N/A
1N/AAnother form of code expression is the S<B<pattern code expression> >.
1N/AThe pattern code expression is like a regular code expression, except
1N/Athat the result of the code evaluation is treated as a regular
1N/Aexpression and matched immediately.  A simple example is
1N/A
1N/A    $length = 5;
1N/A    $char = 'a';
1N/A    $x = 'aaaaabb';
1N/A    $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
1N/A
1N/A
1N/AThis final example contains both ordinary and pattern code
1N/Aexpressions.   It detects if a binary string C<1101010010001...> has a
1N/AFibonacci spacing 0,1,1,2,3,5,...  of the C<1>'s:
1N/A
1N/A    $s0 = 0; $s1 = 1; # initial conditions
1N/A    $x = "1101010010001000001";
1N/A    print "It is a Fibonacci sequence\n"
1N/A        if $x =~ /^1         # match an initial '1'
1N/A                    (
1N/A                       (??{'0' x $s0}) # match $s0 of '0'
1N/A                       1               # and then a '1'
1N/A                       (?{
1N/A                          $largest = $s0;   # largest seq so far
1N/A                          $s2 = $s1 + $s0;  # compute next term
1N/A                          $s0 = $s1;        # in Fibonacci sequence
1N/A                          $s1 = $s2;
1N/A                         })
1N/A                    )+   # repeat as needed
1N/A                  $      # that is all there is
1N/A                 /x;
1N/A    print "Largest sequence matched was $largest\n";
1N/A
1N/AThis prints
1N/A
1N/A    It is a Fibonacci sequence
1N/A    Largest sequence matched was 5
1N/A
1N/AHa! Try that with your garden variety regexp package...
1N/A
1N/ANote that the variables C<$s0> and C<$s1> are not substituted when the
1N/Aregexp is compiled, as happens for ordinary variables outside a code
1N/Aexpression.  Rather, the code expressions are evaluated when perl
1N/Aencounters them during the search for a match.
1N/A
1N/AThe regexp without the C<//x> modifier is
1N/A
1N/A    /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;
1N/A
1N/Aand is a great start on an Obfuscated Perl entry :-) When working with
1N/Acode and conditional expressions, the extended form of regexps is
1N/Aalmost necessary in creating and debugging regexps.
1N/A
1N/A=head2 Pragmas and debugging
1N/A
1N/ASpeaking of debugging, there are several pragmas available to control
1N/Aand debug regexps in Perl.  We have already encountered one pragma in
1N/Athe previous section, S<C<use re 'eval';> >, that allows variable
1N/Ainterpolation and code expressions to coexist in a regexp.  The other
1N/Apragmas are
1N/A
1N/A    use re 'taint';
1N/A    $tainted = <>;
1N/A    @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
1N/A
1N/AThe C<taint> pragma causes any substrings from a match with a tainted
1N/Avariable to be tainted as well.  This is not normally the case, as
1N/Aregexps are often used to extract the safe bits from a tainted
1N/Avariable.  Use C<taint> when you are not extracting safe bits, but are
1N/Aperforming some other processing.  Both C<taint> and C<eval> pragmas
1N/Aare lexically scoped, which means they are in effect only until
1N/Athe end of the block enclosing the pragmas.
1N/A
1N/A    use re 'debug';
1N/A    /^(.*)$/s;       # output debugging info
1N/A
1N/A    use re 'debugcolor';
1N/A    /^(.*)$/s;       # output debugging info in living color
1N/A
1N/AThe global C<debug> and C<debugcolor> pragmas allow one to get
1N/Adetailed debugging info about regexp compilation and
1N/Aexecution.  C<debugcolor> is the same as debug, except the debugging
1N/Ainformation is displayed in color on terminals that can display
1N/Atermcap color sequences.  Here is example output:
1N/A
1N/A    % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
1N/A    Compiling REx `a*b+c'
1N/A    size 9 first at 1
1N/A       1: STAR(4)
1N/A       2:   EXACT <a>(0)
1N/A       4: PLUS(7)
1N/A       5:   EXACT <b>(0)
1N/A       7: EXACT <c>(9)
1N/A       9: END(0)
1N/A    floating `bc' at 0..2147483647 (checking floating) minlen 2
1N/A    Guessing start of match, REx `a*b+c' against `abc'...
1N/A    Found floating substr `bc' at offset 1...
1N/A    Guessed: match at offset 0
1N/A    Matching REx `a*b+c' against `abc'
1N/A      Setting an EVAL scope, savestack=3
1N/A       0 <> <abc>             |  1:  STAR
1N/A                               EXACT <a> can match 1 times out of 32767...
1N/A      Setting an EVAL scope, savestack=3
1N/A       1 <a> <bc>             |  4:    PLUS
1N/A                               EXACT <b> can match 1 times out of 32767...
1N/A      Setting an EVAL scope, savestack=3
1N/A       2 <ab> <c>             |  7:      EXACT <c>
1N/A       3 <abc> <>             |  9:      END
1N/A    Match successful!
1N/A    Freeing REx: `a*b+c'
1N/A
1N/AIf you have gotten this far into the tutorial, you can probably guess
1N/Awhat the different parts of the debugging output tell you.  The first
1N/Apart
1N/A
1N/A    Compiling REx `a*b+c'
1N/A    size 9 first at 1
1N/A       1: STAR(4)
1N/A       2:   EXACT <a>(0)
1N/A       4: PLUS(7)
1N/A       5:   EXACT <b>(0)
1N/A       7: EXACT <c>(9)
1N/A       9: END(0)
1N/A
1N/Adescribes the compilation stage.  C<STAR(4)> means that there is a
1N/Astarred object, in this case C<'a'>, and if it matches, goto line 4,
1N/Ai.e., C<PLUS(7)>.  The middle lines describe some heuristics and
1N/Aoptimizations performed before a match:
1N/A
1N/A    floating `bc' at 0..2147483647 (checking floating) minlen 2
1N/A    Guessing start of match, REx `a*b+c' against `abc'...
1N/A    Found floating substr `bc' at offset 1...
1N/A    Guessed: match at offset 0
1N/A
1N/AThen the match is executed and the remaining lines describe the
1N/Aprocess:
1N/A
1N/A    Matching REx `a*b+c' against `abc'
1N/A      Setting an EVAL scope, savestack=3
1N/A       0 <> <abc>             |  1:  STAR
1N/A                               EXACT <a> can match 1 times out of 32767...
1N/A      Setting an EVAL scope, savestack=3
1N/A       1 <a> <bc>             |  4:    PLUS
1N/A                               EXACT <b> can match 1 times out of 32767...
1N/A      Setting an EVAL scope, savestack=3
1N/A       2 <ab> <c>             |  7:      EXACT <c>
1N/A       3 <abc> <>             |  9:      END
1N/A    Match successful!
1N/A    Freeing REx: `a*b+c'
1N/A
1N/AEach step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the
1N/Apart of the string matched and C<< <y> >> the part not yet
1N/Amatched.  The S<C<< |  1:  STAR >> > says that perl is at line number 1
1N/An the compilation list above.  See
1N/AL<perldebguts/"Debugging regular expressions"> for much more detail.
1N/A
1N/AAn alternative method of debugging regexps is to embed C<print>
1N/Astatements within the regexp.  This provides a blow-by-blow account of
1N/Athe backtracking in an alternation:
1N/A
1N/A    "that this" =~ m@(?{print "Start at position ", pos, "\n";})
1N/A                     t(?{print "t1\n";})
1N/A                     h(?{print "h1\n";})
1N/A                     i(?{print "i1\n";})
1N/A                     s(?{print "s1\n";})
1N/A                         |
1N/A                     t(?{print "t2\n";})
1N/A                     h(?{print "h2\n";})
1N/A                     a(?{print "a2\n";})
1N/A                     t(?{print "t2\n";})
1N/A                     (?{print "Done at position ", pos, "\n";})
1N/A                    @x;
1N/A
1N/Aprints
1N/A
1N/A    Start at position 0
1N/A    t1
1N/A    h1
1N/A    t2
1N/A    h2
1N/A    a2
1N/A    t2
1N/A    Done at position 4
1N/A
1N/A=head1 BUGS
1N/A
1N/ACode expressions, conditional expressions, and independent expressions
1N/Aare B<experimental>.  Don't use them in production code.  Yet.
1N/A
1N/A=head1 SEE ALSO
1N/A
1N/AThis is just a tutorial.  For the full story on perl regular
1N/Aexpressions, see the L<perlre> regular expressions reference page.
1N/A
1N/AFor more information on the matching C<m//> and substitution C<s///>
1N/Aoperators, see L<perlop/"Regexp Quote-Like Operators">.  For
1N/Ainformation on the C<split> operation, see L<perlfunc/split>.
1N/A
1N/AFor an excellent all-around resource on the care and feeding of
1N/Aregular expressions, see the book I<Mastering Regular Expressions> by
1N/AJeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3).
1N/A
1N/A=head1 AUTHOR AND COPYRIGHT
1N/A
1N/ACopyright (c) 2000 Mark Kvale
1N/AAll rights reserved.
1N/A
1N/AThis document may be distributed under the same terms as Perl itself.
1N/A
1N/A=head2 Acknowledgments
1N/A
1N/AThe inspiration for the stop codon DNA example came from the ZIP
1N/Acode example in chapter 7 of I<Mastering Regular Expressions>.
1N/A
1N/AThe author would like to thank Jeff Pinyan, Andrew Johnson, Peter
1N/AHaworth, Ronald J Kimball, and Joe Smith for all their helpful
1N/Acomments.
1N/A
1N/A=cut
1N/A