Wednesday, August 6, 2008

Files


Opening
Perl is very good at handling files. Create, in your perl scripts directory c:\scripts, a file called stuff.txt. Copy the following into it :

The Main Perl Newsgroup:comp.lang.perl.misc
The Perl FAQ:http://www.perl.com/faq/
Where to download perl:http://www.activestate.com/

Now, to open and do things with this file. First, we must open the file and assign it to a filehandle. All operations will be done on the file via the filehandle. Earlier, we used as a filehandle - we read from it.
$stuff="c:\scripts\stuff.txt";

open STUFF, $stuff;

while () {
print "Line number $. is : $_";
}

What this script does is fail. What is should do is open the file defined in $stuff , assign it to the filehandle STUFF and then, while there are still lines left in the file, print the line number $. and the current line.



An unforgivable error
It fails. That's not so bad, everything fails sometimes. What is unforgivable is NOT CHECKING THE ERROR CODE !
This is a better version:

open STUFF, $stuff or die "Cannot open $stuff for read :$!";

If the open operation fails, the or means that the code on the RHS (right hand side) is evaluated. Perl dies. This means it exits the script, performs a post-mortem which it writes up into $! and tells you the line number at which it died. Just because $! contains useful information doesn't mean to say it is automagically printed, in true perl fashion. Usually you will wish to avail yourself of the information inside as it is of great help when working out why something is not going according to plan. The moral of the chapter is:
Always check your return codes !


\\ or / in pathnames -- your choice
The problem should now be apparent. The backslashes, being escape characters, are not displayed. There are two ways to fix this:

Escape the backslashes, like so $stuff="c:\\scripts\\stuff.txt";
Convert backslashes into forward slashes : $stuff="c:/scripts/stuff.txt";
The forward slashes are the preferred option, even under Win32, because you can then port the script direct to Unix or other platforms (assuming you don't use drive letters), and it is less typing. If you wish to use Perl to start external processes then you must use the \\ method, but this variable will be used only in a Perl program, not as a parameter to start an external program. Changing the $stuff variable results in a working script. Always check your return codes !

Reading a file
$stuff="c:/scripts/stuff.txt";

open STUFF, $stuff or die "Cannot open $stuff for read :$!";

while () {
print "Line $. is : $_";
}

A little more detail on what is happening here. The file is opened for read. You can append and write too. You don't have to use a variable, but I always do because it is then easy to change and easy to insert into the or die section, and it is easy to change later on. Hardcoding things is not the best way to write a maintainable and flexible program. Just ask the Year 2000 people about code that lived a little longer than the authors imagined :-).
open STUFF, "c:/scripts/stuff.txt" or die "Cannot open stuff.txt for read :$!";

is just as good but more work if you want to change anything.
The line input operator (that's the angle brackets <> reads from the beginning of the file up until and including the first newline. The read data goes into $_ , and you can do what you want with it there. On the next iteration of the loop data is read from where the last read left off, up to the next newline. And so on until there is no more data. When that happens the condition is false and the loop terminates. That's the default behaviour, but we can change this.

This means that you can open a 200Mb file in perl and run through it without having to load the entire file into memory. 200Mb of memory is quite a bit. If you really want to load the entire 200Mb file into one variable, Perl lets you. Limits are not the Perl Way.

The special variable $. is the current line number, starting at 1.

As usual, there is a quicker way to do the previous program.

$STUFF="c:/scripts/stuff.txt";

open STUFF or die "Cannot open $STUFF for read :$!";

while () {
print "Line $. is : $_";
}

This saves a little bit of typing, but does tie your filehandle to the variable name. In fact, that entire program could be compressed further, but that's for later.

If you are really into shortness, try this:

$STUFF="c:/scripts/stuff.txt";

open STUFF or die "Cannot open $STUFF for read :$!";

print "Line $. is : $_" while ();





Writing to a File


A simple write
$out="c:/scripts/out.txt";

open OUT, ">$out" or die "Cannot open $out for write :$!";

for $i (1..10) {
print OUT "$i : The time is now : ",scalar(localtime),"\n";
}

Note the addition of > to the filename. This opens it for writing. If we want to print to the file we now just specify the filehandle name. You print to the filehandle, which is a gateway to the file.

Filehandles don't have to be capitalised, but it is wise. All Perl functions are lowercase, and Perl is case-sensitive. So if you choose uppercase names they are guaranteed not to conflict with current or future function words.

And a neat way to grab the date sneaked in there too. You should be aware that writing to a file overwrites the file. It does not append data! However, you may append:


Appending
$out="c:/scripts/out.txt";

&printfile;

open OUT, ">>$out" or die "Cannot open $out for append :$!";

print OUT 'The time is now : ',scalar(localtime),"\n";

close OUT;

&printfile;

sub printfile {
open IN, $out or die "Cannot open $out for read :$!";
while () {
print;
}
close IN;
}

This script demonstrates subroutines again, and how to append to a file, that is write additional data at the end. The close function is introduced here. This, well, closes a filehandle. You don't have to close a filehandle - just leave it open until the script finishes, or the next open command to the same filehandle will close it for you.

@ARGV: Command Line Arguments
Perl has a special array called @ARGV . This is the list of arguments passed along with the script name on the command line. Run the following perl script as:

perl myscript.pl hello world how are you


foreach (@ARGV) {
print "$_\n";
}

Another useful way to get parameters into a program -- this time without user input. The relevance to filehandles is as follows. Run the following perl script as:
perl myscript.pl stuff.txt out.txt

while (<>) {
print;
}

Short and sweet ? If you don't specify anything in the angle brackets, whatever is in @ARGV is used instead. And after it finishes with the first file, it will carry on with the next and so on. You'll need to remove non-file elements from @ARGV before you use this.
It can be shorter still:

perl myscript.pl stuff.txt out.txt

print while <>;

Read it right to left. It is possible to shorten it even further !
perl myscript.pl stuff.txt out.txt

print <>;

This takes a little explanation. As you know, many things in Perl, including filehandles, can be evaluated in list or scalar context. The result that is returned depends on the context.
If a filehandle is evaluated in scalar context, it returns the first line of whatever file it is reading from. If it is evaluated in list context, it returns a list, the elements of which are the lines of the files it is reading from.

The print function is a list operator, and therefore evaluates everything it is given in list context. As the filehandle is evaluated in list context, it is given a list !

Who said short is sweet? Not my girlfriend, but that's another story. The shortest scripts are not usually the easiest to understand, and not even always the quickest. Aside from knowing what you want to achieve with the program from a functional point of view, you should also know wheter you are coding for maximum performance, easy maintenance or whatever -- because chances those goals may be to some extent mutually exclusive.


Modifying a File with $^I
One of the most frequent Perl tasks is to open a file, make some changes and write it back to the original filename. You already have enough knowledge to do this. The steps would be:

Make a backup copy of the file
Open the file for read
Open a new temporary file for write
Go through the read file, and write it and any changes to the temp file
When finished, close both files
Delete the original file
Rename the temp file to the original filename
If you have managed to get this far and assiduously work through the examples, the above will be child's play. Play if you want, but there is a Better Way.
Make sure you have data in c:\scripts\out.txt then run this:

@ARGV="c:/scripts/out.txt";

$^I=".bk"; # let the magic begin

while (<>) {
tr/A-Z/a-z/; # another new function sneaked in
print; # this goes to the temp filehandle, ARGVOUT,
# not STDOUT as usual, so don't mess with it !
}

So, what's happening? First, we load up @ARGV with the name of a file. It doesn't matter how @ARGV is loaded. We could have shifted the code from the command line.
The $^I is a special variable. You knew that just by looking at it. It's name is the Inplace Edit variable, and when it has a value the effects are:

The name of the file to be in-placed edited is taken from the first element of @ARGV. In this case, that is c:/scripts/out.txt. The file is renamed to its existing name plus the value of $^I, ie out.txt.bk.
The file is read as usual by the diamond operator <>, placing a line at a time into $_.
A new filehandle is opened, called ARGVOUT, and no prizes for guessing it is opened on a file called out.txt. The original out.txt is renamed.
The print prints automatically to ARGVOUT, not STDOUT as it would usually.
At the end of the operation you have neatly edited the file and made a backup. If you don't want a backup, assign a null string to $^I but don't go crying on any mailing lists if you lose data.
The usual method of in-place editing would involve just printing everything back where it came from until your regex finds whatever needs changing. You could of course slurp the whole file into memory and play with it there, which could be a lot easier but if you are dealing with files of more than a few megabytes this is probably not a feasible approach.

Now take a look at out.txt . Notice how all capital letters have been transliterated into lowercase. This is the tr operator at work, which is more efficient than regex for changing single characters. But that's only a small part of the tr function's value to the world. More later.

You should also have an out.txt.bk file. And finally, notice the way @ARGV has been created. You don't have to create it from the command line arguments -- it can be treated like an ordinary array, for that is what it is.




$/ -- Changing what is read into $_
On a different note, what if your input file is doesn't look like this:
Beer
Wine
Pizza
Catfood

which is nicely delimited with a newline each time, but like this:
shorts
t-shirt
blouse

pizza
beer
wine
catfood

Viz
Private Eye
The Independent
Byte

toothpaste
soap
towel

which is delimited by TWO newlines, not one. You don't have to save the above as shop.txt, but if you don't, the examples will be difficult to follow.
Now, if you want each set of items as elements in an array you'll have to do something like this:

$SHOP="shop.txt";
$x=0;

open SHOP or die "Can't open $SHOP for read: $!\n";

while () {
if (/^\n/) { # does line begin with newline ?
$x++; # if so, increment $x. Rest of if statement not executed.
} else {
$list[$x].=$_; # glue $_ on the end of whatever is in $list[$x], using a .
}
}

foreach (@list) {
print "Items are:\n$_\n\n";
}

which works, but there is a much easier way to do it. You knew I was going to say that.
$SHOP="shop.txt";
$/="\n\n";

open SHOP or die "Can't open $SHOP for read: $!\n";

while () {
push (@list, $_);
}

foreach (@list) {
print "Items are:\n$_\n\n";
}

The $/ variable is a special variable (it even looks special). It is the Default Input Record Separator. Remember the operation of the angle brackets being to read a file in up until the next newline ? Time to come clean. What the angle bracket actually do is read up until whatever $/ is set to. It is set to a newline by default.
So if we set it to two newlines, as above, then it reads up until it finds two consecutive newlines, then puts the data into $_ This makes the program a lot shorter and quicker. You can set $/ to just about anything, not just a newline. If you want to hack this list for example:

Tea:Beer:Wine:Pizza:Catfood:Coffee:Chicken:Salmon:Icecream
you could just leave $/ as a newline and slurp it into memory in one go, but imagine the above items are a list of clothes that your girlfriend wants to buy or a list of clothes your boyfriend should have thrown away by now. Either are going to be really big files, and you might not want to read it all into memory in one go. So set $/=":"; and all will be well. There are also read and seek functions, but they aren't covered here. Those are useful for files where you read in a precise number of bytes.
We'll go back to the last example for a moment. It is useful to know how to read just one line (well, up to $/ ) at a time:

$SHOP="shop.txt";
$/="\n\n";

open SHOP or die "Can't open $SHOP for read: $!\n";

$clothes=; # everything up until the first occurrence of $/ into $clothes

$food=; # everything from first occurrence of $/ to the second into $food

print "We need...\n",$clothes,"...and\n",$food;

And now we know that, there is a even quicker way to achieve the aim of the original program :

$SHOP="shop.txt";
$/="\n\n";

open SHOP or die "Can't open $SHOP for read: $!\n";

@list=; # dumps *all* of $SHOP into @list, not just one line.

foreach (@list) {
print "Items are:\n$_\n\n";
}

and you don't need to grab it all :
@list[0..2]=

. We haven't mentioned list context for a while. Whether the line input operator <> returns a single value or a list depends on the context you use it in. When you supply @xxxxx then this must be a list. If you supply $xxxxx then that's a scalar variable. You can force it into list context by using parens.
The two lines below are provided so you can paste them into the above program. They demonstrate how parens force list context. Remember to replace the foreach with something that prints the variables.

($first, $second) = ;
$first, $second = ;




HERE Docs
The problem:
print "This is a long line of text which might be too long to fit on just one line\n";
print "and I was right, it was too long to fit on one line. In fact, it looks like it\n";
print "might very well take up to FOUR, yes FOUR lines to print. That's four print\n";
print "statements, which takes up even more room. But wait! I'm wrong! It will take\n";
print "FIVE lines to print this statement! Or is that six lines? I'm not sure....\n";

The solution:
$var='variable interpolated';

print <This is a long line of text which might be too long to fit on just one line
and I was right, it was too long to fit on one line. In fact, it looks like
it might very well take up to FOUR, yes FOUR lines to print.

That's four print statements, which takes up even more room. But wait! I'm
wrong! It will take FIVE lines to print this statement! Or maybe six lines?
I'm not sure....but anyway, just to prove this can be $var.
PRT

That's called a 'here' document and you don't need to use PRT, you can use whatever you like within reason. You don't need to put in explicit newlines, although if you do they perform as usual. Now you know about here docs you can stop wearing the print function out by calling it every couple of lines. You don't have to use here docs to print to files, just anywhere you'd normally put a more than one print statement.

No comments: