Before Perl became a general purpose programming language, it was
PERL: the Practical Extraction and Report Language. You can find the
evolutionary remains of Perl's humble beginnings hidden away in dark
corners of the language. Formats, for example, are a Perl language
construct with a syntax unlike any other Perl construct and which
generally have functionality that can be emulated with other routines
(notably printf()
). For these and other reasons, most people first
learning Perl seem to skip over information about formats, but if you
write any reasonable number of scripts to produce reports from long
files of data, formats can be a valuable tool.
printf()
statements, but when I
gave the code to Tom Limoncelli, he sent it back to me with all of the
printf()
statements replaced with format code. Darn it,
his version was nicer (but my checkbook was balanced first).
I wanted to make the data file as easy to type as possible, so the format is very simple. The first line of the input file is the starting balance, in pennies (no need to type a decimal point and no floating point arithmetic). Each of the following lines represents a transaction: four tab separated fields giving the check number or transaction code, the date, a description, and the amount (again in pennies). Deposits and other credits to the account are represented as negative values (I seem to put money into my accounts much less frequently than I take it out). Here is a simple program to read this input file and generate a statement of the account:
format STDOUT = @<<<<< @>>>> @<<<<<<<<<<<<<<<<<<< $@######.## $@######.## $code, $date,$descript, $amt, $balance . open(INP, "transactions") || die "Can't read transactions file\n"; chop($penny_balance = <INP>); while (<INP>) { chop; ($code, $date, $descript, $penny_amt) = split(/\t/); $penny_balance -= $penny_amt; $amt = $penny_amt / 100; $balance = $penny_balance / 100; write; } close(INP); format top = . Trans: Date: Description: Amount: Balance: ====== ===== ============ ======= ======== .The first four lines in the example are a format declaration. The first line defines the format's name. When the
write()
function is
called to print a line of formatted data, it uses the format named for
the currently selected file handle. In our example, the program is
sending the report to the standard output. Note that if no format name
is specified, STDOUT
is assumed, but it is always better to name
formats explicitly, even when you are using STDOUT
.
The second line is a picture of how each output line will look. Each
group of characters beginning with an @
is an output field specifier -
everything else is a literal (e.g., the $
signs at the beginning of
the two money fields). Less-than (<) signs mean that the field should
be left justified, and greater-than (>) signs mean right justified;
the pipe symbol (|) specifies centered fields. Numeric fields are
indicated with hash marks (#) and an optional decimal point. The field
width is the number of special characters, INCLUDING the @
sign (in
the example below, the first field is six characters wide, the second
is five, etc.). This enables the picture to resemble a somewhat
abstract but perfectly aligned example of the output.
The picture's third line associates a variable with each field. When
the write()
function is called, the current value of each of the named
variables is printed using the specified format. It is clearer to read
if you to try and line up the variable specifications with their
associated field specifications on the line above.
The last line of a format declaration is always a dot on a line by itself. This terminates the format declaration.
Format declarations can appear anywhere in the program. The example above contains two format declarations: one before the code and one after. This was done to make the point; in your own code, I recommend you group all formats together near the top of the script. If there are multiple formats with the same name in the program, the one defined last will be the one that gets used.
If a format with the special name top is defined in the program, this
format will be printed at the beginning of each page of formatted
output. The special variable $=
defines the number of
lines per page; 60 is the default, but you can assign a smaller number
if you like (for example, when printing to a terminal or small
window). The special variable $-
gives the number of
lines left on the current page. You can force a new page by setting
$-
to 0. However: DO NOT mix print()
and
printf()
statements with write()
or else the
$-
variable will not be decremented correctly.
write()
usually uses the format
named for the file handle that the output is going to, you can use a
different format by assigning the alternate format's name to the
special $~
variable. The trick then, is to keep track of the number of
lines left on the page and emit a special footer format at the bottom
of the page. Here is the program logic for doing this:
format top = Trans: Date: Description: Amount: Balance: ====== ===== ============ ======= ======== . format STDOUT = @<<<<< @>>>> @<<<<<<<<<<<<<<<<<<< $@######.## $@######.## $code, $date,$descript, $amt, $balance format footer = Page @### $% . $footer_depth = 2; open(INP, "transactions") || die "Can't read transactions file\n"; chop($penny_balance = <INP>); while (<INP>) { chop; ($code, $date, $descript, $penny_amt) = split(/\t/); $penny_balance -= $penny_amt; $amt = $penny_amt / 100; $balance = $penny_balance / 100; write; if ($- == $footer_depth) { $~ = "footer"; write; $~ = "STDOUT"; } } close(INP);First we introduce a new footer format and a new global constant,
$footer_depth
, which is the number of lines that the
footer occupies on the page. The footer format in our example uses yet
another special variable, $%
, which gives the current
page number (numbered starting with 1).
Each time we emit a line with write()
, we check
$-
for the number of lines remaining on the page. When we
have exactly $footer_depth
lines left, it is time to
write the page footer. To write the footer, we simply set
$~
to the name of the footer format (footer
,
this example), issue a write()
, and then reset $~
to the usual format (STDOUT
) before getting the
next line from the transaction file. This line will appear on the next
page after the header in the usual fashion.
While this method works very cleanly, when each write()
statement only
outputs a single line, anticipating the end of page when using
multi-line formats can get tricky. Also notice that no footer will be
output on the last page. Additional code would have to be added after
the while()
loop to output additional blank lines and the footer. This
is left as an exercise to the reader.
If you ever want to change header formats for any reason - for example
if you wanted a large header on the first page, but only minimal
headers on the other pages - you can use the special $^
variable. This variable behaves like $~
, but selects the
header format instead. Never set $^
(or $~
for that matter) to a non-existent format because this will cause your
program to exit with a fatal error at run time. If you want a null
header, never define the top format at all, or set $^
to
an empty format.
Second, the format declaration defines multiple lines of output. This also is perfectly legal and each line can have zero, one, or more field declarations in it. The general pattern for multi-line format declarations is one line of field descriptions, followed by a line containing the variables associated with those fields, followed by another line of field descriptions, etc.
The next example shows an interesting use of multi-line formats. For
purposes of this example program, we are assuming a function called
mailparse()
which processes email messages one at a time
from the standard input. For each message, mailparse()
puts all header information in a global associative array,
%header
, indexed by the header tag (e.g.,
From
, To
) and all of the body lines in a
global scalar variable called $body
. The output is shown
below the example. By the way, my editor never sent me that message: I
made it up. Like all writers, I am always early for all
deadlines. Well, that last part was a lie, but I really did make up
the email message.
format message = Date: @<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $header{`Date'}, $body From: @<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $header{`From'}, $body To : @<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $header{`To'}, $body Subj: ^<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $header{`Subject'}, $body ~~ ^<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $header{`Subject'}, $body . $~ = "message"; while (<STDIN>) { &mailparse(); write; } Date: Tue, 11 Apr 1995 16:39:06 Hal-- What's the status of your Perl From: [email protected] (Tina M. Darmo article for the upcoming issue of ;login;? To : [email protected] (Hal Pom Rob needs to review the article before Subj: Your ;login: article is giving it to Carolyn for typesetting. *OVERDUE* Please send email soon-- the fate of the universe is at stake. --TinaThere are a number of new constructs in the message format in this example. First are the fields that begin with
^
instead of @
. For these
fields, Perl outputs as much text as will fit in the field and then
removes that text from the string variable. By stacking several ^
fields together using the same long string, you can output that string
as a block of text with a ragged right margin, as shown in the output,
with both the body of the message and the Subj: line. The special $:
variable (last special variable in this column, I promise) is the set
of characters on which Perl can legally break the line; the
default value for $:
is \n -
(newline, space, or hyphen).
The special ~~
marker on the last line means "keep
outputting lines until all variables ($body
and
$header{Subject}
in this case) are exhausted." This is
useful for situations where you are not sure how long your text may
run, but you want to be able to output all of the information. You can
put the ~~
anywhere on the line, but it is best to put it
in a very visible location (the beginning of the line is almost always
best).
printf()
blocks that
would have been much easier to write and much more readable if the
developer had used formats instead. If you need to quickly produce
reports, or output large amounts of tabulated data, formats are an
extremely effective tool.
Reproduced from ;login: Vol. 20 No. 3, June 1995.
Back to Table of Contents
11/27/96ah