Wednesday, August 6, 2008

Subsitution and Yet More Regex Power


Basic changes
Suppose you want to replace bits of a string. For example, 'us' with 'them'.
$_='Us ? The bus usually waits for us, unless the driver forgets us.';

print "$_\n";

s/Us/them/; # operates on $_, otherwise you need $foo=~s/Us/them/;

print "$_\n";

What happens here is that the string 'Us' is searched for, and when a match is found it is replaced with the right side of the expression, in this case 'them'. Simple.
You'll notice that only one substitution was made. To match globally use /g which runs through the entire string, changing wherever it can. Try:

s/Us/them/g;


which fails. This is because regexes are not, by default, case-sensitive. So:
s/us/them/ig;


would be a better bet. Now, everything is changed. A little too much, but one problem at a time. Everything you have learn about regex so far can be used with s/// , like parens, character classes [ ] , greedy and stingy matching and much more. Deleting things is easy too. Just specify nothing as the replacement character, like so s/Us//; .
So we can use some of that knowledge to fix this problem. We need to make sure that a space precedes the 'us'. What about:

s/ us/them/g;


An small improvement. The first 'Us' is now no longer changed, but one problem at a time ! We'll first consider the problem of the regex changing 'usually' and other words with 'us' in them.
What we are looking for is a space, then 'us', then a comma, period or space. We know how to specify one of a number of options - the character class.

s/ us[. ,]/them/g;


Another tiny step. Unfortunately, that step wasn't really in the right direction, more on the slippery slope to Poor Programming Practice. Why ? Because we are limiting ourselves. Suppose someone wrote ' send it to us; when we get it'.
You can't think of all the possible permutations. It is often easier, and safer, to simply state what must not follow the match. In this case, it can be anything except a letter. We can define that as a-z. So we can add that to the regex.

s/ us[^a-z]/ them/g;


the caret ^ negates the character class, and a-z represents every alphabet from a to z inclusive. A space has been added to the substitution part - as the original space was matched, it should be replaced to maintain readability.



\w
What would be more useful is to use a-zA-Z instead. If we weren't using /i we'd need that. As a-zA-Z is such a common construct, Perl provides an easy shorthand:
s/ us[^\w]/ them/g;


The \w construct actually means 'word' - equivalent to a-zA-Z_0-9 . So we'll use that instead.
To negate any construct, simply capitalise it:

s/ us[\W]/ them/g;


and of course we don't need the negating caret now. In fact, we don't even need the character class !
s/ us\W/ them/g;


So far, so good. Matching the first 'us' is going to be difficult though. Fortunately, there is an easy solution. We've seen Perl's definition of a word - \w . Between each word is a boundary. You can match this with \b .
s/\bus\W/ them/g;


(that's \b followed by 'us', not 'bus' :-)
Now, we require a word boundary before 'us'. As there is a 'nothing' at the start of the string, we have a match. There is a space after the first 'Us', so the match is successful. You might notice an extra space has crept in - that's the space we added earlier. The match doesn't include the space any more - it matches on the word boundary, that is just before the word begins. The space doesn't count.
Did you notice the final period and the comma are replaced ? They are part of the match - it is the


Replacing with what was found
\W that matches them. We can't avoid that. We can however put back that part of the match.
s/\bus(\W)/them\1/g;


We start with capturing whatever the \W matches, using parens. Then, we add it to the replacement string. The capture is of course in $1 , but as it is in a regex we refer to it as \1 .
The final problem is of course capitalising the replacement string when appropriate. Which in old versions of the tutorial I left as an exercise to the reader, having run out of motivation. A reader by the name of Paul Trafford duly solved the problem, and I have just inserted his excellent explanation for the elucidation of all concerned:


# Solution to the us/them problem...
#
# The program works through the text assigning the
# variable $1 to 'U' or 'u' for any words where this
# letter is followed by 's' and then by non 'word'
# characters. The latter is assigned to variable $2.
#
# For each such matching occurrence, $1 is replaced by
# the letter that precedes it in the alphabet using
# operations 'ord' and 'chr' that return the ASCII value
# of a character and the character corresponding to a
# given natural number. After this 'hem' is tacked on
# followed by $2, to retain the shape of the original
# sentence. The '/e' switch is used for evaluation.
#
# NOTES
# 1. This solution will not replace US (short for
# United States) with Them or them.
#
# 2. If a 'magical' decrement operator '--' existed for
# strings then the solution could be simplified for we
# wouldn't need to use the 'chr' and 'ord' operators.


$_='Us ? The bus usually waits for us, unless the driver forgets us.';

print "$_\n";

s/\b([Uu])s(\W)/chr(ord($1)-1).hem.$2/eg;

print "$_\n";

An excellent solution, thanks Paul.

There are several more constructs. We'll take a quick look at \d which means anything that is a digit, that is 0-9 . First we'll use the negated form, \D , which is anything except 0-9 :

print "Enter a number :";
chop ($input=);

if ($input=~/\D/) {
print "Not a number !!!!\n";
} else {
print 'Your answer is ',$input x 3,"\n";

}

this checks that there are no non-number characters in $x . It's not perfect because it'll choke on decimal points, but it's just an example. Writing your own number-checker is actually quite difficult, but it is an interesting exercise. Try it, and see how accurate yours is.



x
I hope you trusted me and typed the above in exactly as it is show (or pasted it), because the x is not a mistake, it is a feature. If you were too smart and changed it to a * or something change it back and see what it does.
Of course, there is another way to do it :

unless ($input=~/\d/) {
print 'Your answer is ',$input x 3,"\n";
} else {
print "Not a number !!!!\n";
}

which reverses the logic with an unless statement.

More Matching
Assume we have:
$_='HTML munging time is here again !.';

and we want to find all the italic words. We know that /g will match globally, so surely this will work :
$_='HTML munging time is here again ! What fun !';

$match=/(.*?)<\/i>/ig;

print "$match\n";

except it returns 1, and there were definitely two matches. The match operator returns true or false, not the number of matches. So you can test it for truth with functions like if, while, unless Incidentally, the s/// operator does return the number of substitutions.
To return what is matched, you need to supply a list.

($match) = /(.*?)<\/i>/i;

which handily puts all the first match into $match . Note that an = is used (for assignment), as opposed to =~ (to point the regex at a variable other than $_.

The parens force a list context in this case. There is just the one element in the list, but it is still a list. The entire match will be assigned to the list, or whatever is in the parens. Try adding some parens:

$_='HTML munging time is here again ! What fun !';

($word1, $word2) = /(.*?)<\/i>/ig;

print "Word 1 is $word1 and Word 2 is $word2\n";

In the example above notice /g has been added so a global replacement is done - this means perl carries on matching even after it finds the first match. Of course, you might not know how many matches there will be, so you can just use an array, or any other type of list:
$_='HTML munging time is here again ! What fun !';

@words = /(.*?)<\/i>/ig;

foreach $word (@words) {
print "Found $word\n";
}

and @words will be grown to the appropriate size for the matches. You really can supply what you like to be assigned to:
($word1, @words[2..3], $last) = /(.*?)<\/i>/ig;

you'll need more italics for that last one to work. It was only a demonstration.
There is another trick worth knowing. Because a regex returns true each time it matches, we can test that and do something every time it returns true. The ideal function is while which means 'do something as long the condition I'm testing is true'. In this case, we'll print out the match every time it is true.

$_='HTML munging time is here again ! What fun !';

while (/<(.*?)>(.*?)<\/\1>/g) {
print "Found the HTML tag $1 which has $2 inside\n";
}

So the while operator runs the regex, and if it is true, carries out the statements inside the block.
Try running the program above without the /g . Notice how it loops forever ? That's because the expression always evaluates to true. By using the /g we force the match to move on until it eventually fails.

Now we know this, an easy way to find the number of matches is:

$_='HTML munging time is here again ! What fun !';

$found++ while /.*?<\/i>/ig;

print "Found $found matches\n";

You don't need braces in this case as nothing apart from the expression to be evaluated follows the while function.

Parentheses Again: OR
The real use for them. Precedence. Try this, and yes you can try it at home:
$_='One word sentences ? Eliminate. Avoid clichés like the plague. They are old hat.';

while (/o(rd|ne|ld)/gi) {
print "Matched $1\n";
}

Firstly, notice the subtle introduction of the or operator, in this case | , the pipe. What I really want to explain however, is that this regex matches o followed by rd, ne or ld. Without the parens it would be /ord|ne|ld/ which is definitely not what we want. That matches just plain ord, or ne or ld.




(?: OR Efficiency)
In the interests of efficiency, consider this:
print "Give me a name :";
chop($_=);

print "Good name\n" if /Pe(tra|ter|nny)/;

The code above functions correctly. If you were wondering what a good name is, Petra, Peter and Penny qualify. The regex is not as efficient as it could be though. Think about what Perl is doing with the regex, that you are just ignoring. Simply throwing away casually. Without consideration as to the effort that has gone into creating it for you. The resources squandered. The little bytes of memory whose sole function in life is to store this information, which will never be used.

What's happening is that because parens are used, perl is creating $1 for your usage and abusage. While this may not seem important, a fair amount of resources go into creating $1, $2 and so on. Not so much the memory used to store them, more the CPU effort involved. So, if you aren't going to use the parens for capturing purposes, why bother capturing the match?

print "Give me a name :";
chop($_=);

print "Good name\n" if /Pe(?:tra|ter|nny)/;

print "The match is :$1:\n";

The second print statement demonstrates that nothing is captured this time. You get the benefits of the paren's precedence-changing capabilities, but without the overhead of the capturing. This benefit is especially worthwhile if you are writing CGI programs which use parens in regex -- with CGI, every little of bit efficiency counts.



Matching specific amounts of...
Finally, take a look at this :

$_='I am sleepy....zzzz....DING ! Wake Up!';

if (/(z{5})/) {
print "Matched $1\n";
} else {
print "Match failed\n";
}

The braces { } specify how many of the preceding character to match. So z{2} matches exactly two 'z's and so on. Change z{5} to z{4} and see how it works. And there's more...
/z{3}/ 3 z only
/z{3,}/ At least 3 z
/z{1,3}/ 1 to 3 z
/z{4,8}/ 4 to 8 z


To any of the above you may suffix an question mark, the effect of which is demonstrated in the following program. Run it a couple of times, inputting 2, 3 and 4:

print "How many letters do you want to match ? ";
chomp($num=);

# we assign and print in one smooth move
print $_="The lowest form of wit is indeed sarcasm, I don't think.\n";

print "Matched \\w{$num,} : $1 \n" if /(\w{$num,})/;

print "Matched \\w{$num,?}: $1 \n" if /(\w{$num,}?)/;

The first match is 'match any word (that's a-Z0-9_) equal to or longer than $num character, and return it.' So if you enter 4, then 'lowest' is returned. The word 'The' doesn't match.

The second match is exactly the same, but the ? forces a minimal match, so only the part actually matched is returned.

Just to clear this up, amend the program thus:


print "\nMatched \\w{$num,} :";
print "$1 " while /(\w{$num,})/g;

print "\nMatched \\w{$num,?} :";
print "$1 " while /(\w{$num,}?)/g;

Note the addition of /g . Try it without - notice how the match never moves on ?




Pre, Post, and Match
And now on the Regex Programme Today, we have guest stars Prematch, Postmatch and Match. All of whom are going to slow our entire programme down, but are useful anyway :
$_='I am sleepy....snore....DING ! Wake Up!';

/snore/; # look, no parens !

print "Postmatch: $'\n";
print "Prematch: $`\n";
print "Match: $&\n";


If you are wondering what the difference between match and using parens is you should remember than you can move the parens around, but you can't vary what $& and its ilk return. Also, using any of the above three operators does slow your entire program, whereas using parens will just slow the particular regex you use them for. However, once you've used one of the three matches you might as well use them all over the place as you've paid the speed penalty. Use parens where possible.

RHS Expressions


/e
RHS means Right Hand Side. Suppose we have an HTML file, which contains:


and we wish to double the size of each font so 2 becomes 4 and 4 becomes 8 etc. What about :
$data=" ";

print "$data\n";

$data=~s/(size=)(\d)/\1\2 * 2/ig;

print "$data\n";

which doesn't really work out. What this does is match size=x, where x is any digit. The first match, size=, goes into $1 and the second match, whatever the digit is, goes into $2 . The second part of the regex simply prints $1 and $2 (referred to as \1 and \2 ), and attempts to multiply $2 by 2. Remember /i means case insensitive matching.
What we need to do is evaluate the right hand side of the regex as an expression - that is not just print out what it says, but actually evaluate it. That means work it through, not blindly treat it as string. Perl can do this:

$data=~s/(size=)(\d)/$1.($2 * 2)/eig;

A little explanation....the LHS is the same as before. We add /e so Perl evaluates the RHS as an expression. So we need to change \1 into $1 and so on. The parens are there to ensure that $2 * 2 is evaluated, then joined to $1 . And that's it !




/ee
It is even possible to have more than one /e . For example:
$data='The function is <5funca>';

$funcA='*2+4';

print "$data\n";

$data=~s/<(\d)(\w+)>/($1+2).${$2}/; # first time
# $data=~s/<(\d)(\w+)>/($1+2).${$2}/e; # second time
# $data=~s/<(\d)(\w+)>/($1+2).${$2}/ee; # third time

print "$data\n";


To properly appreciate this you need to run it three times, each time commenting out a different line. Only one regex line should be uncommented when the program is run.

The first time round the regex is a dumb variable interpolation. Perl just searches the string for any variables, finds $1 and $2, and replaces them.

Second time round the expression is evaluated, as opposed to just plain variable-interpolated. This means that $1+2 is evaluated. $1 has a value of 5, pl, plus 2 == 7. The other part of the replacement, ${$2} is evaluated only so far as working out that the variable named $2 should be placed in the string.

Third time round and Perl now makes a second pass through the string, looking for things to do. After the first pass, and just before that second pass the string looks like this; 7*2+4 . Perl evaluates this, and prints the result.

So the more /e 's you add on the end of the regex, the more passes Perl makes through the replacement string trying to evaluate the code.

This is fairly advanced stuff here, and it is probably not something you will use every day. But knowing it is there is handy.



A Worked Example: Date Change
Imagine you have a list of dates which are in the US format of month, day, year as opposed to the rest of the world's logical notion of day, month year. We need a regex to transpose the day and month. The dates are:
@dates=(
'01/22/95',
'05/15/87',
'8-13-96',
'5.27.78',
'6/16/1993'
);

The task can be split into steps such as:
Match the first digit, or two digits. Capture this result.
Match the delimiter, which appears to be one of / - .
Match the second two digits, and capture that result
Rebuild the string, but this time reversing the day and month.
That may not be all the steps, but it is certainly enough for a start. Planning regex is important. So, first pass:
@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'6/16/1993'
);

foreach (@dates) {
print;
s#(\d\d)/(\d\d)#$2/$1#;
print " $_\n";
}

Hmm. This hasn't worked for the dates delimited with - . , and the last date hasn't worked either. The first problem is pretty easy; we are just matching / , nothing else. The second problem arises because we are matching two digits. Therefore, 5/15/87 is matched on the 15 and 87, not the 5 and 15. The date 6/16/1993 is matched on the 16 and the 19 of 1993.

We can fix both of those. First, we'll match either 1 or 2 digits. There are a few ways of doing this, such as \d{1,2} which means either 1 or two of the preceding character, or perhaps more easily \d\d? which means match one \d and the other digit is optional, hence the question mark. If we used \d+ then that would match 19988883 which is not a valid date, at least not as far as we are concerned.

Secondly, we'll use a character class for all the possible date delimiters. Here is just the loop with those amendments:

foreach (@dates) {
print;
s#(\d\d?)[/-.](\d\d?)#$2/$1#;
print " $_\n";
}

which fails. Examine the error statement carefully. The key word is 'range'. What range? Well, the range between / and . because - is the range operator within a character class. That means it is a special character, or a metacharacter. And to negate the special meaning of metacharacters we have to use a backslash.

But wait! I don't hear you cry. Surely . is a metacharacter too? It is, but not within a character class so it doesn't need to be escaped.

foreach (@dates) {
print;
s#(\d\d?)[/\-.](\d\d?)#$2/$1#;
print " $_\n";
}

Nearly there. However, we are always replacing the delimiter with / which is messy. That's an easy fix:
foreach (@dates) {
print;
s#(\d\d?)([/\-.])(\d\d?)#$3$2$1#;
print " $_\n";
}

so that fixes that. In case you were wondering, the . dot does not act as '1 of anything' inside a character class. It would defeat the object of the character class if it did. So it doesn't need escaping. There is a further improvement you can make to this regex:
$m='/.-';

foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)#$3$2$1#;
print " $_\n";
}

which is good practice because you are bound to want to change your delimiters at some point, and putting them inside the regex is hardcording, and we all know that ends in tears. You can also re-use the $m variable elsewhere, which is good pratice.

Did you notice the difference between what we assign to $m and what we had before?

/\-.
$m='/.-';

The difference is that the - is no longer escaped. Why not? Logic. Perl knows - is the range operator. Therefore, there must be a character to the immediate left and immediate right of it in order for it to work, for example e-f. When we assign a string to $m, the range operator is the last character and therefore has no character to the right of it, so Perl doesn't interpret as a range operator. Try this:
$m='/-.';

and watch it fail.

Something else that causes heartache is matching what you don't mean to. Try this:

@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'/16/1993',
'8/1/993',
);

$m='/.-';

foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)#$3$2$1# or print "Invalid date! ";
print " $_\n";
}

The two invalid dates at the end are let through. If you wanted to check the validity of every possible date since the start of the modern calendar then you might be better off with a database rather than a regex, but we can do some basic checking. The important point is that we know the limitations of what we are doing.

What we can do is make sure of two things; that there are three sets of digits seperated by our chosen delimiters, and that the last set of digits is either two digits, eg 99, 98, 87, or four digits, eg 1999, 1998, 1987.

How can we do this? Extend the match. After the second digit match we need to match the delimter again, then either 2 digits or four digits. How about:

$m='/.-';

foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)[$m](\d\d|\d{4})#$3$2$1$2# or print "Invalid date! ";
print " $_\n";
}

which doesn't really work out. The problem is it lets 993 through. This is because \d\d will match on the front of 993. Furthermore, we aren't fixing the year back on to the end result.

The delimiter match is also faulty. We could match / as the first delimiter, and - as the second. So, three problems to fix:

foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)\2(\d\d|\d{4})$#$3$2$1$2$4# or print "Invalid!";
print " $_\n";
}

This is now looking like a serious regex. Changes:
We are re-using the second match, which is the delimiter, further on in the regex. That's what the \2 is. This ensures the second delimiter is the same as the first one, so 5/7-98 gets rejected.
The $ on the end means end of string. Nothing allowed after that. So the regex now has to find either 2 or 4 digits at the end of the string, or it fails.
Added the match of the year ($4) to the rebuild section of the regex.
Regex can be as complex as you need. The code above can be improved still further. We could reject all years that don't begin with either 19 or 20 if they are four-digit years. The other problem with the code so far is that it would reject a date like 02/24/99 which is valid because there are characters after the year. Both can be fixed:
@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'/16/1993',
'8/1/993',
'3/29/1854',
'! 4/23/1972 !',
);

$m='/.-';

foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)\2(\d\d|(?:19|20)\d{2})(?:$|\D)#$3$2$1$2$4# or print "Invalid!";
print " $_\n";
}

We have now got a nested OR, and the inner OR is non-capturing for reasons of efficiency and readability. At the end we alternate between letting the regex match either an end of line or any non-digit, symbolised with \D.

We could go on. It is often very difficult to write a regex that matches anything of even minor complexity with absolute certainity. Think about IP addresses for example. What is important is to build the regex carefully, and understand what it can and cannot do. Catching anything supposedly invalid is a good idea too. Test your regex with all sorts of invalid data, and you'll understand what it can do.

No comments: