Wednesday, August 6, 2008

Split and Join

Splitting
While you are in the regex mood, a quick look at split and join . Destruction is always easier (just ask your car mechanic), so lets start with split .
$_='Piper:PA-28:Archer:OO-ROB:Antwerp';

@details=split /:/, $_;

foreach (@details) {
print "$_\n";
}

Here we give split is given two arguments. The first one is a regex specifying what to split on. The next is what to split. Actually, I could leave $_ out because as usual it is the default if nothing is specified.
The assignment can either be a scalar variable or a list like an array (or hash, but at this time 'hash' to you means what you think the Dutch do or a silly drinking event spoilt by some running). If it's a scalar variable you get the number of elements the split has splut. Should that be 'the split has splittered' or 'the split has splat'. Hmmm. Probably 'the split has split'. You know what I mean. I think I just generated a Fatal Error in English.dll. Whoops. In any case, splitting to a scalar variable is not always a Good Thing, as we'll see later.

If the assignment is an array, then as you can see in the above example the array is created with the relevant elements in order. You can also assign to scalars, for example :

$_='Piper:PA-28:Archer:OO-ROB:Antwerp';

($maker,$model,$name,$reg,$location) = split /:/, $_;
(@aircraft[0..1],$aname,@regdetails) = split /:/, $_;

$number=split /:/ ; # not bothering with the $_ at the end, as it is the default

print "Using the first 'split'\n";
print "$reg is a $maker $model $name based in $location\n";
print "There are $number details available on this aircraft\n\n";

print "Using the second 'split'\n";
print "You can find $regdetails[0], an $aircraft[1], $regdetails[1]\n";


This demonstrates that a list can be a list of scalar variables (which is basically what an array is anyway), and that you can easily see how many elements the expression can be split into.
The example below adds a third parameter to split, which is how many elements you want returned. If you don't want the extra stuff at the end pop it.

$_='Piper:PA-28:Archer:OO-ROB:Antwerp';

@details=split /:/, $_, 3;

foreach (@details) {
print "$_\n";
}

In the example below we split on whitespace. Whitespace, in perl terms, is a space, tab, newline, formfeed or carriage return. Instead of writing \t\n\f\r for each of the above, you can simply use \s , or the negated version \S which means anything except whitespace. Think of whitespace as anything you know is there, but you can't see.
The whitespace split is specially optimised for speed. I've used spaces, double spaces, a tab and a newline in the list below. Also note the + , which means one or more of the preceding character, so it will split on any combination of whitespace. And I think the final split is useful to know. The split function does not return the delimiter, so in this case the whitespace will not be returned.

$_='Piper PA-28 Archer OO-ROB
Antwerp';

@details=split /\s+/, $_;

foreach (@details) {
print "$_\n";
}

@chars=split //, $details[0];

foreach $char (@chars) {
print "$char !\n";
}



A very FAQ
The following question has come up at least three times in the Perl-Win32-Users mailing list. Can you answer it ?
"My data is delimited by |, for example:
name|age|sex|height|
Why doesn't
@array=split /|/, $line;
work ?"

Why indeed. If you don't already know the answer, some simple troubleshooting steps can be applied. First, create a sample program and run it.
$line='name|age|sex|height';

@array=split /|/,$line;

foreach (@array) { print "$_\n" }

The effect is to split each character. The | is returned. As it is the delimiter, | should be ignored, not returned.
At this point you should be thinking 'metacharacter'. A little research (looking at the documentation) will reveal that | is indeed a metacharacter, which means 'or', when inside a regex. So, in effect, the regex /|/ means 'nothing, or nothing'. The split is therefore performed on 'nothings', and there are 'nothings' in between each character. The solution is easy ; /\|/ .

$line='name|age|sex|height';

@array=split /\|/,$line;

foreach (@array) { print "$_\n" }

So that's the fun stuff, destruction. Now to put it back together again with join .


What Humpty Dumpty needs : Join
$w1="Mission critical ?";
$w2="Internet ready modems !";
$w3="J(insert your cool phrase here)"; # anything prefixed by 'J' is now cool ;-)
$w4="y2k compatible.";
$w5="We know the Web.";
$w6="...the leading product in an emerging market.";

$cool=join ' ', $w1,$w2,$w3,$w4,$w5,$w6;

print $cool;

Join takes a 'glue' operator, which is not a regular expression. It can be a scalar variable however. In this case it is a space. Then it takes a list, which can either be a list of scalar variables, an array or whatever as long as its a list. And you can see what the result is. You could assign it to an array, but you'd end up with everything in the first element of the array.
The example below adds an array into the list, and demonstrates use of a variable as the delimiter.

$w1="Mission critical ?";
$w2="Internet ready modems !";
$w3="J(insert your cool phrase here)"; # anything prefixed by 'J' is now cool ;-)
$w4="y2k approved, tested and safe !";
$w5="We know the Web.";
$w6="...the leading product in an emerging market.";
@morecool=("networkable","compatible");

$sep=" ";

$cool=join $sep, $w1,$w2,$w3,@morecool,$w4,$w5,$w6;

print $cool;

No comments: