Perl

05. Regular Expressions

Powerful

Perl's Regular Expressions are considered the most powerful in existance, more features and obscure details than any other set. This set of notes doesn't cover everything.

Use same basic features we saw in Shell with grep, but without all the back-slashing.

Simple Usage

Use the // construct to match something. Will return a "true" value if matched.

while (<>) {             # read line into $_
   if (/greg/) {         # does $_ have greg in it?
      print "great!\n";
   }
}

Single Characters

Type	Matches	Example	Comment
any one (non-meta) character	exactly that character	/a/ matches "a" and only "a"
. (dot)	any one char, except newline	/./ matches any letter, number, punctuation, etc
[chars] (character class)	any one char specified	/[abc]/ matches a single a, b or c	^ after opening bracket negates meaning
\d \D	match a digit or non-digit char		like saying [0-9] or [^0-9]
\w \W	match a word-character or non-word-char		like saying [a-zA-Z0-9_] or [^a-zA-Z0-9_]
\s \S	match a space or non-space char		like saying [ \r\t\n\f] or [^ \r\t\n\f]

Anchors

Type	Matches	Example	Comment
^ (caret, hat, circumflex)	anchor pattern at start of string	/^greg/ matches exactly "greg" at start of string only
$ (dollar sign)	anchor pattern at end of string	/greg$/ matches exactly "greg" at end of string only	will also match newline at end of string
\b	match word boundry	/greg\b/ matches "greg" but not "gregory"	boundry considered where transition between \w and \W occurs
\B	match NOT word boundry	/greg\B/ matches "gregory" but not "greg" (end of string is boundry)	boundry considered where transition between \w and \W occurs

Note word boundry (\w) also is true at start or end of string.

Quantifiers

Type	Matches	Example	Comment
sequence of (non-meta) chars	match exactly that sequence	/greg/ matches exactly "greg"
* (star, asterisk)	zero or more of previous	/gg*/ matches one or more g's	modifier, greedy, like {0,}
+ (plus)	one or more of previous	/gg+/ matches two or more g's	modifier, greedy, like {1,}
? (question mark)	zero or one of previous	/gg?/ matches one or two g's	modifier, greedy, like {0,1}
{n,m} (multiplier)	match n to m occurances of previous in a row	/gg{3,}/ matches four or more g's	modifier

Can force non-greedy on the modifiers by putting a ? immediately after them.

if (/a.*b/) {         # will match as many chars as possible up to last b
   #whatever
}

if (/a.*?b/) {        # will stop matching at first b
   #whatever
}

Grouping

Use ()'s to match a pattern and remember it for later.

/(gr.g)/             # matches g-r-something-g, and remembers it

/(.)greg(.) \2\1/    # remembers chars on either side of greg for later use

After something matches thing inside, stored in registers, first one in register 1, second in 2, etc. Recalled later in pattern with number of register preceded by backslash.

Extra bonus: after match is done, $1, $2, $3... set to values of \1, \2, \3, etc for use outside of matching.

Alternation

Use | to separate choices in a regex.

if (/greg|long/) {     # match if $_ is "greg" or "long"
   #whatever
}

if (/a|b|c/) {         # like [abc]
   #
}

Leftmost choice will be matched first.

Precedence

There is a precedence order to regexs in Perl.

Simplification of the rules:

()'s are highest precedence, use them around everything else
|'s are lowest

/a|b*/       # single-a, or zero-or-more-b's

/^a|b/       # single-a at start of line, or a b anywhere

/(a|b)*/     # zero-or-more a's or b's

// - The Match Operator

We've seen how // surrounds a regex to be tested, there are variations to it's usage.

=~ lets you specify a different thing to be tested than $_. Usage requires target on LHS and regex on RHS. !~ is the negation of that.

if (/^Greg$/) {                           if ($value =~ /^Greg$/) {
   # tests $_ for just Greg                  # tests $value for just Greg
}                                         }

                                          if ($value !~ /^Greg$/) {
                                             # tests $value for not Greg
                                          }

if (<STDIN> =~ /^Greg$/) {
   # tests what is read in against just Greg
}

next LINE if $line =~ /^#/;        # skip to next iteration of LINE: loop
                                   #   if we're seeing a comment at start

Use !~ to test for "not a match".

Can use the //i variant to deal with case, appending the i causes the regex test to ignore case on letters.

if ($input =~ /greg/i) {
   # matches greg, GREG, gREg, etc
}

Can use the m// form to change delimiters to avoid Leaning Toothpick Syndrome.

if (m/greg/) {     # same as without 'm'
}

if (m/\/etc\/passwd/) {
   # ugly, have to backslash all the slashes
}

if (m#/etc/passwd#) {
   # nicer, can choose any delimiter we like
}

Note it is permitted to put a variable in the pattern.

$pattern = "Greg";
if (/$pattern Long/) {
   # match "Greg Long"
}

Match Variables

After a match some variables are set.

Already mentioned $1, $2, $3, etc are set for any parenthesized items that matched.

m// returns matches from registers as list.

if (/(Greg) (Long)/) {
   print "found $2 and $1\n";     # prints "found Long and Greg"
}

$_ = "Greg Long\t123-4567";
($first, $last) = /(\w+) (\w+)\t.*/;

print "first = $1, last = $2\n";          # same thing
print "first = $first, last = $last\n";

$value = "123 456";
($one, $two) = $value =~ /(\d+) (\d+)/;     # get two numbers
                                            # =~ higher precedence than =
                                            # if not a match, undef variables result

Also have three variables set to indicate what matched, useful for learning about regexs or debugging why one isn't matching what you think it should:

$` = part of string before match began (dollar-backtick)
$& = part of string that matched regex
$' = part of string after match ended (dollar-forewardtick)

$data = "Hey, Greg Long is Cool!";

if ($data =~ /G.*g/) {
   print "found match with text: $&\n";        # prints "Greg Long"
   print "pre-match: $`\n";                    # prints "Hey, "
   print "post-match: $'\n";                   # prints " is Cool!"
}

Note these will be reset to new values the very next time a match is called for, so if you need them, save them off into another variable quickly.

Due to the implementation of perl, use of those last three is a big performance hit. They are not normally set on matches anywhere in your script. But once your script references them once they are set for every match.

s/// - The Substitution Operator

Use s/old/new/ to perform a substitution on a string, similar to sed.

while (<>) {
   s/greg/long/;        # replace all greg's with long's as you read them
}

Append "g" to s/// to make sure all possible replacements occur.

Can still use "i" to ignore case, put it right there next to g.

Can use =~ to perform substitution on another target.

Can also use alternate delimiters if you like.

while ($line = <>) {
   $line =~ s/greg/long/gi;
}

$path = <STDIN>;
$path =~ s@/etc@/home@g;

$path =~ s{/etc}
          {/home}g;

Match Modifiers

Use //i for case-insensitive pattern matching.

Use ///g for "global" search-replace.

Use //o for "once only" compilation. Useful when there is a variable in the pattern.

Use //x to allow for whitespace and comments in regex. For complex regex's to be spelt out. Three identical regex's from Prog Perl, 3rd Ed:

m/\w+:(\s+\w+)\s*\d+/;

m/\w+: (\s+ \w+) \s* \d+/x;

m{
  \w+:     # match a word then colon
  (        # group:
    \s+    #   one or more spaces
    \w+    #   another word
  )
  \s*      # optional whitespace
  \d+      # some digits
}x;

To put real whitespace or #'s in your //x'd regex, use character class or backslash to escape. Or just use \s for whitespace.

s///e used to put code in the replace portion.

s/([0-9]+)/sprintf("%02x", $1)/ge;   # replace numbers with hex versions

s/(\w+)/$values{$1}?$1:X/eg;         # replace words with X if hash
                                     #  values doesn't have key of that name

tr/// - The Translation Operator

Very similar in usage to s///.

Works on $_ by default. Can change the delimiter char. Takes regexs.

$value = "greg 1";

$value =~ tr/g/x/;            # value now "xrex 1"

$value =~ tr/[a-z]/[A-Z]/;    # upcase everything that is lower, now "XREX 1"

$value =~ tr/[a-z]/x/g        # change all lowercase to x

$value =~ tr/x/d;             # delete x's

$value =~ tr/ /s;             # squeeze duplicate spaces

$value =~ tr/[a-z]/x/c;       # replace non-small-letters with x,
                              # c == "compliment" first pattern

Note: tr, s/// and m// are builtin operators, not functions. They are more like "+" than chomp(). Find them in "man perlop", not "man perlfunc".

split & join

split is used to break a string into pieces based on a delimiter. The delimiter isn't just a character, can be a regular expression.

@fields = split(/:/, $line);                     # break passwd file entry into parts

($login, $passwd, @rest) = split(/:/, $line);    # similar

@fields = split /[:%]/, $line;                   # break on either delimiter

@fields = split /\s+/, $line;                    # break on spacing

The default split pattern is /\s+/, and the default split target is $_, so that last one can be rewritten as: @fields = split;

The join function puts a list of things together and glues them with a particular string (not a RE).

$passwd = join(":", $login, $passwd, $uid, $gid, $name, $home, $shell);

$passwd = join(":", @fields);

$line = join(":", split(/:/, $line));            # expensive no-op
$line = join(":", reverse(split(/:/, $line)));   # reverse fields