Wednesday, August 6, 2008

Basic Regular Expressions



An introduction
Or regex for short. These can be a little intimidating. But I'll bet you have already used some regex in your computing life so far. Have you even said "I'll have any Dutch beer ?" That's a regex which will match a Grolsch or Heineken, but not a Budweiser, orange juice or cheese toastie. What about dir *.txt ? That's a regular expression too, listing any files ending in .txt.
Perl's regex often look like this:

$name=~/piper/

That is saying "If 'piper' is inside $name, then True."

The regular expression itself is between / / slashes, and the =~ operator assigns the target for the search.

An example is called for. Run this, and answer it with 'the faq'. Then try 'my tealeaves' and see what happens.

print "What do you read before joining any Perl discussion ? ";
chomp ($_=);

print "Your answer was : $_\n";

if ($_=~/the faq/) {
print "Right ! Join up !\n";
} else {
print "Begone, vile creature !\n";
}

So here $_ is searched for 'the faq'. Guess what we don't need ! The =~ . This works just as well:
if (/the faq/) {

because if you don't specify a variable, then perl searches $_ by default. In this particular case, it would be better to use
if ($_ eq "the faq") {
as we are testing for exact matches.

Senstivity -- regexes in touch with their inner child
But what if someone enters 'The FAQ' ? It fails, because the regex is case sensitive. We can easily fix that:

if (/the faq/i) {

with the /i switch, which specifies case-insensitivity. Now it works for all variations, such as "the Faq" and "the FAQ".
Now you can appreciate why a regular expression is better in this situation than a simple test using eq . As the regex searches one string for another string, a response of "I would read the FAQ first !" will also work, because "the FAQ" will match the regex.

Study this example just to clarify the above. Tabs and spaces have been added for aesthetic beauty:

$_="perl for Win32"; # sets the string to be searched

if ($_=~/perl/) { print "Found perl\n" }; # is 'perl' inside $_ ? $_ is "perl for Win32".
if (/perl/) { print "Found perl\n" }; # same as the regex above. Don't need the =~ as we are testing $_
if (/PeRl/) { print "Found PeRl\n" }; # this will fail because of case sensitivity
if (/er/) { print "Found er\n" }; # this will work, because there is an 'er' in 'perl'
if (/n3/) { print "Found n3\n" }; # this will work, because there is an 'n3' in 'Win32'
if (/win32/) { print "Found win32\n" }; # this will fail because of case sensitivity
if (/win32/i) { print "Found win32 (i)\n" }; # this will *work* because of case insensitivity (note the /i)

print "Found!\n" if / /; # another way of doing it, this time looking for a space

print "Found!!\n" unless $_!~/ /; # both these are the same, but reversing the logic with unless and !
print "Found!!\n" unless !/ /; # don't do this, it will always never not confuse nobody :-)
# the ~ stays the same, but = is changed to ! (negation)

$find=32; # Create some variables to search for
$find2=" for "; # some spaces in the variable too

if (/$find/) { print "Found '$find'\n" }; # you can search for variables like numbers
if (/$find2/) { print "Found '$find2'\n" }; # and of course strings !

print "Found $find2\n" if /$find2/; # different way to do the above

As you can see from the last example, you can embed a variable in the regex too. Regular expressions could fill entire books (and they have done, see the book critiques at http://www.perl.com/) but here are some useful tricks:

Character Classes
@names=qw(Karlson Carleon Karla Carla Karin Carina Needanotherword);

foreach (@names) { # sets each element of @names to $_ in turn
if (/[KC]arl/) { # this line will be changed a few times in the examples below
print "Match ! $_\n";
} else {
print "Sorry. $_\n";
}
}

This time @names is initialised using whitespace as a delimiter instead of a comma. qw refers to 'quote words', which means split the list by words. A word ends with whitespace (like tabs, spaces, newlines etc).
The square brackets enclose single characters to be matched. Here either Karl or Carl must be in each element. It doesn't have to be two characters, and you can use more than one set. Change Line 4 in the above program to:

if (/[KCZ]arl[sa]/) {

matches if something begins with K, C, or Z, then arl, then either s or a. It does not match KCZarl. Negation is possible too, so try this :

if (/[KCZ]arl[^sa]/) {

which returns things beginning with K, C or Z, then arl, and then anything EXCEPT s or a. The caret ^ has to be the first character, otherwise it doesn't work as the negation. Having said [ ] defines single characters only, I should mention than these two are the same :
/[abcdeZ]arl/;
/[a-eZ]arl/;

if you use a hyphen then you get the list of characters including the start and finish characters. And if you want to match a special character (metacharacter), you must escape it:
/[\-K]arl/;

matches Karl or -arl. Although the - character is represented by two characters, it is just the one character to match.

Matching at specific points
If you want to match at the end of the line, make sure a $ is the last character in the regex. This one pulls out all those names ending in a. Slot it into the example above :

if (/a$/) {

And there is a corresponding character, the caret ^ , which in this context matches at the beginning of the string. Yes, the caret also negates a character class like this [^KCZ]arl but in this case it anchors the match to the beginning of the string.


if (/n/i) {
if (/^n/i) {

The first one is true if the word contains an 'n' anywhere in it. The second specifies that the 'n' must be at the beginning of the string to be matched. Use this anchor where you can, because it makes the whole regex faster, and safer if you know what the first character must be.

Negating the regex
If you want to negate the entire regex change =~ to !~ (Remember ! means 'not equal to'.)

if ($_ !~/[KC]arl/) {

Of course, as we are testing $_ this works too:
if (!/[KC]arl/) {



Returning the Match
Now things get interesting. What if we want pull something out of a string ? So far all we have done is test for truth, that is say yea or nay if a string matches, but not return what we found. Run this:
$_='My email address is .';

/()/i;

print "Found it ! $1\n";

Firstly, note the single quotes when $_ is assigned. If there were double quotes, we'd need \@ instead of @ . Remember, double quotes "" allow variable interpolation, so Perl looks for an array called @NetCat which does not exist.
Secondly, look at the parens around the entire regex. If you use parens, a side effect is that the first match is put into a variable called $1 . We'll get to the main effect later. The second match goes into $2 and so on. Also note that the \@ has been escaped, so perl doesn't think it is an array. Remember \ either escapes a special character, or gives a special meaning. Think of it as Superman's telephone box. Imagine Clark Kent walking around with his magic partner Back Slash.

Notice how we specify in the regex case-insensitivity with /i and the regex returns the case-sensitive string - that is, exactly what it found.

Try the regex without parens. Then try this one:

/<(robert)\@netcat.co.uk>/i;

You can put the parens anywhere. More or less. Now, run this :
$_='My email address is .';

/<(robert)\@(netcat.co.uk)>/i;

print "Found it ! $1 at $2\n";

See, you can have more than one ! Look at the above regex. Looks easy now, don't you think ? What about five minutes ago ? It would have looked like a typing mistake ! Well, there are some hairier regex to come, but you'll have a good barber.

* + -- regexes become line noise
What if we didn't know what the email address was going to be ?

$_='My email address is .';

print "Found it ! :$1:" if /(<.*>)/i;

When you see an if statement like this, read it right to left. The print statement is only executed if code on the right of the expression is true.
We'll discuss this. Firstly, we have the opening parens ( . So everything from ( to ) will be put into $1 if the match is successful. Then the first character of what we are searching for, < . Then we have a dot, or period . . For this regex, we can assume . matches any character at all.

So we are now matching <> .

This is important. Get the basics right and all regex are easy (I read somewhere once). An example best illustrates the point. Slot this regex in instead:

$_='My email address is .';

print "Found it ! :$1:" if /(<*>)/i;


What's happening here ?
The regex starts, logically, at the start of the string. This doesn't mean it starts a 'M', it starts just before M. There is a 'nothing' between the string start and 'M'.

The regex is searching for <* , which is 0 or more < .

The first thing it finds is not < , but the nothing in between the start of the string and the 'M' from 'My email...". Does this match ?

As the regex is looking for "0 or more" < , we can certainly say that there are 0 < at the start of the string. So the match is, so far, successful. We have dealt with <* .

However, the next item to match is > . Unfortunately, the next item in the string is 'M', from 'My email..". The match fails at this point. Sure, it matched < without any problem, but the complete match has to work.

The only two characters that can match successfully at this point are <> . The 'point' being that <* has been matched successfully, and we need either > to complete the match or more of < to continue the '0 or more' match denoted by * .

'M' is neither of them, so it fails at this point, when it has matched

Quick clarification - the regex cannot successfully match < , then skip on ahead through the string until it matches > . The characters in the string between < > also need to match the regex, and they don't in this case.

All is not lost. Regexes are hardy little beasts and don't give up easily. An attempt is made to match the regex wherever possible. The regex system keeps trying the match at every possible place in the string, working towards the end.

Let's look at the match when it reaches the 'm' in 'work.com'.

Again, we have here 0 < . So the match works as before. After success on <* the next character is analysed - it is a > , so the match is successful.

But, be warned. The match may be successful but your job is not done. Assuming the objective of was to return the email address within the angle brackets then that regex is a miserable failure. Watch for traps of this nature when regexing.

That's * explained. Just to consolidate, a quick look at:

$_='My email address is .';
print "Match 1 worked :$1:" if /(<*)/i;

$_='.';
print "Match 2 worked :$1:" if /(<*)/i;

$_='My email address is .';
print "Match 3 worked :$1:" if /(<*>)/i;


Match 1 is true. It doesn't return anything, but it is true because there are 0 < at the very start of the string.
Match 2 works. After the 0 < at the start of the string, there is 1 < so the regex can match that too.

Match 3 works. After the failing on the first < , it jumps to the second. After that, there are plenty more to match right up until the required ending.

Glad you followed that. Now, pay even closer attention ! Concentrate fully on the task at hand ! This should be straightforward now:

$_='HTML munging time !.';

/(.*)<\/I>/i;

print "Found it ! $1\n";

Pretty much the same as the above, except the parens are moved so we return what's only inside the tags, not including the tags themselves. Also note how / is escaped like so; \/ otherwise Perl thinks that's the end of the regex.
Now, suppose we change $_ to :

$_='HTML munging time is here again !.';


and run it again. Interesting effect, eh ? This is known as Greedy Matching. What happens is that when Perl finds the initial match, that is it jumps right to the end of the string and works back from there to find a match, so the longest string matches. This is fine unless you want the shortest string. And there is a solution:
/(.*?)<\/I>/i;

Just add a question mark and Perl does stingy matching. No nationalistic jokes. I have Dutch and Scottish friends I don't want to offend.




The Difference Between + and *
You know what * means, namely match 0 or more. If you want to match 1 or more, then use + . The difference is important.
$_='The number is 2200 and the day is Monday';

($star)=/([0-9]*)/;

($plus)=/([0-9]+)/;

print "Star is '$star' and Plus is '$plus'\n";

You'll note that $star has no value. The match was successful though. It managed to match 0 or more characters from 0 to 9 at the very start of the regex.

The second regex with $plus worked a little better, because we are matching one or more characters from 0 to 9. Therefore, unless one 0 to 9 is found the match will fail. Once a 0-9 is found, the match continues as long as the next character is 0-9, then it stops.

Now we know this, there is another way to remove an email address from within angle brackets:

$_='My email address is !.';

/<([^>]+)/i;

print "Found it ! $1\n";

This regex matches <. Then the capturing parens start. They have no effect on this regex other than to capture the match. After that, there is a character class, containing one character. As ^ is the first character is the class, it negates the class. That's why we are using a character class with only one character in it, because it can be negated.

So far we have matched <>. The + ensures we match as many characters that are not <'s as we can. This has the same effect as .*? but is more efficient. It may also suit your purposes, as .*? relies on you knowing what you want to match up to, whereas [^>]+ simply contines matching until it finds something that fails its criteria. Just make sure you understand the difference because it is a crucial part of regexery.



Re-using the match -- \1, $1...
Suppose we didn't know what HTML tag we had to match ? It could be B, I, EM or whatever, and we want everything that is in between. Well, HTML container tags like B and EM have end tags which are the same as the start tag, except for the / . So what we could do is:

find out what is inside < >
search for exactly the same tag, but with the closing /
return whatever is in between.
Can this be done ? Of course. This is perl, all things are possible. Now, remember the side effect of parens. I promise I'll explain the primary effect at some point. If whatever is in (parens) matches, the result is stored in a variable called $1 . So we can use <(.*?)> which will find us <> (the ? forces stingy matching).
The result is stored in $1 because we used parens. Next, we need everything up to the closing tag. That's easy : (.*?) matches everything up until the next character or set of characters. And how exactly do we define where to stop ?

We can use $1 even in the same regex it was found in. However, it is not referred to within a regex as $1 , but \1 .

So we want to match which in perl code is <\/\1> . The / must be escaped because it is the end of the regex, and 1 is escaped so it refers to $1 instead of matching the number 1.

Still here ? This is what it looks like:

$_='HTML munging time is here again !.';
/<(.*?)>(.*?)<\/\1>/i;

print "Found it ! $2\n";

If you want to know how to return all the matches above, read on. But before that:

How to Avoid Making Mountains while Escaping Special Characters
You want to match this; http://language.perl.com/faq/ . That's a real (useful) URL by the way. Hint. To match it, you need to do this:

/http:\/\/language\.perl\.com\/faq\//;

which should make the awful metaphor above clearer, if not funnier. The slash, / , is not normally a metacharacter but as it is being used for the regular expression delimiters, it needs to be escaped. We already know that . is special.
Fortunately for our eyes, Perl allows you to pick your delimiter if you prefix it with 'm' as this example shows. We'll use a #:

m#http://language\.perl\.com/faq/#;

Which is a huge improvement, as we change / to # . We can go further with readability by quoting everything:
m#\Qhttp://language.perl.com/faq/\E#;

The \Q escapes everything up until \E or the regex delimiter (so we don't really need the \E above). In this case # will not be escaped, as it delimits the regex.
Someone once posted a question about this to the Perl-Win32-Users mailing list and I was so intrigued about this apparently undocumented trick I spent the next twenty minutes figuring it out by trial and error, and posted a reply. Next day I found lots of messages telling the poster to read the manual because it was clearly documented. My excuse was I didn't have the docs to hand....moral of the story - RTFM and RTF FAQs !


No comments: