return to first page linux journal archive
keywordscontents

Linux Programing Hints

Prototyping Algorithms in Perl

Perl is often considered a scripting language for systems administrators. Jim demonstrates that it is useful to applications and scientific programmers as well--as a prototyping tool.

by Jim Shapiro

If you are like many Linux users you may have heard of Perl, but have been reluctant to learn another language. This was my situation several months ago. A friend suggested I give Perl a try. Since I already knew C, Perl was a snap to learn. I soon found myself doing all sorts of text reformatting using Perl. My friends and coworkers were impressed, but skeptical. Could Perl cut the mustard on big files where a ton of data had to be read, massaged and written?

Their skepticism subsided, however, when I wrote a Perl program (I prefer to call them programs rather than scripts just to separate them from shell scripts) to load over three and a half million lines of US Postal data. Now I actually teach Perl (but I digress).

In this article I would like to suggest a use for Perl which is often overlooked--Perl as a prototyping tool. Most C programmers spend a fair amount of time managing memory, and I am no exception. Memory management is a necessary function, especially if you want to keep your C programs tight--not using more memory than necessary--and well behaved--not crashing with the resulting core dumps. The problem with managing memory yourself is that it can divert attention from the program's purpose, which is typically to get an algorithm running.

With Perl, not only do you get solid memory handling routines, you get them in an interpreted/compiled environment and, thanks to the Free Software Foundation, they will not cost you a penny. In fact, if you have Linux you probably already have Perl as well. Let me illustrate how Perl can be used as a prototyping tool with two examples, a simple Monte Carlo calculation and a more substantial geometric problem.

A Monte Carlo Estimate Of pi

Monte Carlo techniques use probabilistic methods to make estimates. Typically, one or more random numbers are substituted into a function and the resulting value is tested for validity. The program keeps track of both the number of satisfactory tests and the total number of tests. The result is the ratio of satisfactory to total tests and this ratio is monitored as the number of tests is increased.


#!/usr/bin/perl

# Monte Carlo calculation of pi

srand(time | $$);

for($i = 1, $inside = 0.0; $i <= 1000000; $i++)
   {
   $x = rand;
   $y = rand;

   $inside++ if $x * $x + $y * $y < 1.0;

   printf "After %7d points pi ~= %9.7lf\n", $i, 4.0 * $inside / $i if
      ($i % 10000) == 0;
   }
Figure 1. A monte Carlo Calculation of pi


Let us get our feet wet with a short and simple Perl program. Figure 1 is a program which estimates pi using a Monte Carlo calculation. Consider a circle inscribed inside a square of side two centered at the origin. The area of the square is four and, since the circle has a radius of one, its area is pi. The ratio of the area of the circle to that of the square is thus pi / 4, and this is also the chance that a random point within the square is also inside the circle. This program repeats a loop a million times, each time calling Perl's rand function twice, once for x and once for y. The distance of the x, y pair from the origin is calculated and, if it is less than one, it is counted as a successful test. (Technically, this program uses only the parts of the circle and the square in the first quadrant for simplicity, but, by symmetry, the ratio of areas is the same as if the whole of both figures had been used.)

A Linux tip--if you give your Perl programs a unique extension, like .pl, it is easy to make them stand out in ls type listings. Adding the line:

.pl 01;33 # perl programs (yellow)
to the file /etc/DIR_COLORS will make the names of all files with .pl extensions in listings appear in yellow. See the man pages for ls and the file /etc/DIR_COLORS for details.

Note, first of all, how short the Perl program is. Also, note how much it resembles a C program, especially the for loop and the printf function. There are important differences, however. Variables, like $i, $x, etc. are used when needed without a specific declaration. It is not even clear to the casual observer what the types of the variables are. And what is with these if statements after the statements they control? And rand is used like a function, but there are no parentheses--maybe it is a keyword.

All of these differences are features of Perl. Perl keeps track of your variables for you. The variables are really strings internally, but they get converted to doubles when needed such as when the distance of the random point from the origin is calculated and compared to one. You needn't concern yourself with any of these details, however. The if test at the end of a line is a handy equivalent to C's if block (the C style is OK in Perl also) but a lot shorter. You will find yourself using Perl's if style all the time once you get the hang of it. Perl's if has three relatives (unless, while and until) which also can be used before a block or at the end of a series of comma separated statements. Finally, rand really is a function--in this case the parentheses are optional. This is the case with many functions, including the printf at the bottom of the for loop.

If you run the program in Figure 1, you will get an estimate for pi after every 10,000 tests. I call the program m_c.pl and run it by typing its name. The first line is the path to the Perl program on my system. Change this path if yours is different. You can also test a program for syntax errors with a -c command line switch, i.e.

perl -c m_c.pl

Perl will also provide warnings, such as when you assign to a variable and never use the variable again. Use the -w command line switch to turn on warnings. This is a handy way to uncover spelling errors which can easily crop up in an environment without explicit variable declarations. I usually use both tests simultaneously during development, i.e.

perl -cw m_c.pl

Point In Polygon

Next let us consider a more difficult problem. How do we test whether a point is inside a polygon? This problem is not as simple as it may first appear, especially when you take into account the special cases--such as when the point is on the boundary of the polygon. Is that inside, outside, or do you want to count it as a separate case--on the boundary? Let's break the problem down and start by writing a Perl subroutine to test whether a point is on a line.

A not-too-efficient but easy-to-code routine works as follows. If a point is on a line the sum of the distances from the point to each of the line endpoints is the same as the distance between the endpoints, i.e. the length of the line. You might want to reread the previous sentence and make a little sketch to convince yourself that this is indeed the case. We will first need a routine (in Perl they are called subroutines) to find the distance between two points. C has a built-in function, hypot, but the one-liner in Figure 2 is the Perl equivalent. The sub keyword denotes a subroutine and the subroutine's name follows. There is no parameter list in the definition of a subroutine. Those are comments following the subroutine name in Figure 2. We will supply the parameter list when we call the subroutine, as you will see shortly.


sub hypot # (x, y) returns sqrt(x * x + y * y)
{
sqrt $_[0] * $_[0] + $_[1] * $_[1];
}
Figure 2. Perl Subroutine to Find Distance Between Points


When you call a subroutine in Perl all of the parameters are automatically put into an array @_ (The @ denotes a standard array). In this case the array has two elements, the differences in the x and y coordinates of the two points. Note that the elements are referenced by address so that we need to be careful. Any modifications to these variables would change the values outside this subroutine. In Perl you can also create local variables within a subroutine, or inside of any block for that matter. Normally I would do so, but since this subroutine is so short and is likely to be called many times, I avoided the overhead of local variables. Note again, that a Perl function, sqrt, was called without any parentheses. The return value of a subroutine is the last thing calculated (or you can overrride this behavior with a specific return value). Here the Pythagorean result is the last value calculated, so no explicit return is necessary. You will probably find, as I have, that explicit return statements are rarely needed.

Now we need to tell our program what points and areas are. In C we seem to have the advantage of structures at our disposal. We would probably set up something like the typedef and declaration in Figure 3.


typedef struct
   {
   double x,
          y;
   } POINT;

POINT *polygon_p;
Figure 3. Typedef and Declaration


A point has x and y coordinates and a polygon has three or more points. Every time we need a polygon we can malloc the space for it and fill it with points--and free the space when we are done with it. This is where Perl shines.

Perl has neither explicit memory allocation nor structures. The good news is that we do not have to concern ourselves with memory--it is there when we need it. The bad news is that we have to come up with some system like C's structures. This turns out to be trivial. Let us put everything into strings. Perl takes care of strings of any length, relieving us of the pesky problem of counting characters. You do not even have to add one for the terminating null character--Perl does not use a terminator.

Our point will simply be a string containing two doubles joined by a comma; our line will be a string containing two points joined by a colon; and our polygon will be a string containing three or more points joined by colons, like so:

$point = "1.0,2.0";
$line = "0.0,0.0:0.0,1.0";
$polygon = "1.0,2.0:6.0,5.0:0.0,3.1";

These "structures" turn out to be easy to scan visually and very easy to manage, thanks to Perl's join and split functions. Figure 4 is the point_on_line subroutine I developed using the above scheme.


 
$epsilon = 1.0e-10;

sub point_on_line #(point, line) returns 0 or 1
{ # if on - sum of distance to ends should be distance between ends
local($point, $line) = @_;
local($p_x, $p_y) = split(",", $point);
local($l_b, $l_e) = split(":", $line);
local($l_b_x, $l_b_y) = split(",", $l_b);
local($l_e_x, $l_e_y) = split(",", $l_e);

&fabs(&hypot($p_x - $l_b_x, $p_y - $l_b_y) +
      &hypot($p_x - $l_e_x, $p_y - $l_e_y) -
      &hypot($l_e_x - $l_b_x, $l_e_y - $l_b_y)) < $epsilon;
}

sub fabs # (x) returns absolute value of x
{
local($rv) = @_;

$rv = -$rv if $rv < 0.0;
$rv;
}
Figure 4. point-on-line Subroutine

This subroutine is called with two strings, i.e.

&point_on_line($point, $line);
and returns one if the point is on the line, and zero otherwise. Note that in this case I put the calling parameters into local arrays for safety and ease of maintenance--$point and $line are easier to remember next week than $_[0] and $_[1]. The split function's use should be obvious. The point string gets split into two coordinate strings by the comma, and the line string gets broken into two point strings by the colon. Then the points at the beginning and end of the line, $l_b and $l_e, get broken down into their respective coordinates. Finally the coordinates are passed to our hypot function. Note how the & sign is used to prefix a subroutine for the call. Perl has no equivalent to C's fabs so I quickly rolled my own in Figure 4. I used the $epsilon variable to avoid floating point roundoff problems.

We have been building this thing from the bottom up. So far we have a routine to test whether a point is on a line. Our goal is a routine to test whether a point is within a polygon. Let me give you an overview of how we are going to solve the problem, provide you with two more Perl routines (which you should be able to read now), and then show you the external "workhorse" routine that does the final inside/on/outside determination.

Our polygon is a closed figure and every point is either inside, outside, or on the boundary (there is actually a mathematical statement called Jordan's Curve Theorem that proves this!). If we can establish a point that is guaranteed to be outside of our polygon, we can test the "insideness" of any point by connecting this test point to our "outside" point by a straight line and counting the number of times the line crosses the polygon. Clearly if the line never crosses the polygon, the point is outside; if it crosses once the point is inside; and, in general, if the line crosses the polygon's boundary an odd number of times, the point is inside. An even number of crossings means the point is outside. To put it another way, every time the line crosses the polygon it changes from being either inside to outside or vice-versa. Starting from the outside point our first crossing (if any) puts us inside; the next crossing (if any) puts us back outside; etc., odd-inside, even-outside. So, it looks like we will need a routine to test whether two lines intersect.

One way of testing for line intersection is to make sure both endpoints of each line are on opposite sides of the other line. (Reread and make a few pictures, as before.) This leads to our final routine. Now we will test whether, when we walk along a line from the beginning point to the opposite end and then turn to go straight to another point, we turn counterclockwise or clockwise. This discussion and the resulting routines are modeled after Sedgewick's in "Algorithms in C", pages 349-354 (Robert Sedgewick, Addison Wesley, 1990). Let us start with the counterclockwise subroutine, ccw in Figure 5.


sub ccw # (three points) return -1, 0, or 1
{
local(@points) = @_;
local($rv) = 0;
local($dx1, $dx2, $dy1, $dy2, $p0x, $p0y, $p1x, $p1y, $p2x, $p2y);

($p0x, $p0y) = split(",", $points[0]);
($p1x, $p1y) = split(",", $points[1]);
($p2x, $p2y) = split(",", $points[2]);

$dx1 = $p1x - $p0x;
$dy1 = $p1y - $p0y;
$dx2 = $p2x - $p0x;
$dy2 = $p2y - $p0y;

switch:
   {
   $rv =  1, last if $dx1 * $dy2 > $dy1 * $dx2;
   $rv = -1, last if $dx1 * $dy2 < $dy1 * $dx2;
   $rv = -1, last if ($dx1 * $dx2 < 0.0) || ($dy1 * $dy2 < 0.0);
   $rv =  1, last if ($dx1 * $dx1 + $dy1 * $dy1) < ($dx2 * $dx2 + $dy2 * $dy2);
   }

$rv;
}
Figure 5. Counterclockwise (ccw Subroutine)

The ccw routine accepts three points and compares the slope of the line from the second to the third with the slope of the line from the first to the second. It is carefully constructed to handle vertical lines and even the case of collinear points. Note how the input points are collected into a local array, @points, avoiding side effects and making the program easy to maintain and understand. The points each get split into their respective coordinates with the test being carried out in a block labeled switch (for obvious reasons). By now you have probably guessed that Perl allows labels, just like C. There is no specific switch construct in Perl, however. The block in ccw above is Perl's way of building a switch. The last keyword immediately exits the block and sometimes the Perl interpretor is smart enough to convert the block to a C switch statement internally (although not in this case).


sub intersect # (two lines) returns 0 or 1
{
local($l1, $l2) = @_;
local($l1_b, $l1_e) = split(":", $l1);
local($l2_b, $l2_e) = split(":", $l2);

&ccw($l1_b, $l1_e, $l2_b) * &ccw($l1_b, $l1_e, $l2_e) <= 0 &&
&ccw($l2_b, $l2_e, $l1_b) * &ccw($l2_b, $l2_e, $l1_e) <= 0;
}
Figure 6. Intersection Subroutine


Figure 6 is the intersection routine. It is very straightforward. The lines are split into endpoints and the endpoints are tested for "sideness".

We now have all the building blocks to construct our "inside" subroutine. First we connect our test point with an "outside" point via a straight line. Then we walk around the polygon, testing each polygon side in turn, accumulating the intersections of the sides with the test line. If the number of intersections is odd the point is inside, and vice-versa.


$big_float = 1.0e7;

sub lower_left_index # (polygon) returns index of lower left corner
{
local($polygon) = @_;
local($index) = 0;
local(@vertices) = split(":", $polygon);
local($x_min, $y_min) = split(",", $vertices[0]);
local($i, $x, $y);

for($i = 1; $i <= $#vertices; $i++)
   {
   ($x, $y) = split(",", $vertices[$i]);
   if(($y < $y_min) || (($y == $y_min) && ($x < $x_min)))
      {
      $x_min = $x;
      $y_min = $y;
      $index = $i;
      }
   }

$index;
}

sub inside # (point, polygon) returns 0 or 1
{
local($point, $polygon) = @_;
local(@vertices) = split(":", $polygon);
local($index) = &lower_left_index($polygon);
local($last_index) = $index ? $index - 1 : $#vertices;
local($count, $holding_point) = (0, 0);
local($i, $lp, $lt, $vertex, $x, $y, $big_x_point);
local($check_index) = $index; # true if index is not zero

OUTER: for(;;)
   { # one pass loop
   for($i = 0, $vertex = $vertices[$#vertices]; $i <= $#vertices; $i++)
      {
      $lp = join(":", $vertex, $vertices[$i]);
      $vertex = $vertices[$i];

      if(&point_on_line_2($point, $lp))
         {
         $count = 1;
         print "Point on boundary\n" if defined $verbose;
         last OUTER;
         }
      }

   ($x, $y) = split(",", $point);
   $big_x_point = join(",", $big_float, $y);
   $lt = join(":", $point, $big_x_point);

   for($i = 0; $i <= $#vertices; $i++)
      {
      if(&point_on_line_2($vertices[$index], $lt))
         {
         $holding_point = 1;
         }
      else
         {
         if(!$holding_point)
            {
            $lp = join(":", $vertices[$last_index], $vertices[$index]);
            $count++ if &intersect($lp, $lt);
            }
         elsif($holding_point &&
               (&ccw($point, $big_x_point, $vertices[$index]) !=
                &ccw($point, $big_x_point, $vertices[$last_index])))
            {
            $count++;
            }

         $last_index = $index;
         $holding_point = 0;
         }

      $index++;

      if($check_index && ($index == @vertices))
         {
         $check_index = 0;
         $index = 0;
         }
      }

   last;
   } # one pass "loop"

$count & 1;
}
Figure 7. lower-left-index and inside Subroutines


The pair of subroutines in Figure 7 are Perl implementations of functions suggested by Sedgewick in the section "Inclusion in a Polygon" (pages 353-355), although more complete. The lower_left_index subroutine returns the index of the polygon point with the smallest x coordinate among all points with the smallest y coordinate. This is necessary to account for the special cases when a polygon vertex is on the test line. Note how, in the third line within the block, the @vertices array is automatically constructed by splitting the $polygon string with colons. Each element of the @vertices array is a pair of coordinates separated by a comma, one for each vertex. Whenever we need individual x, y pairs the split function is called, as we have seen before. This occurs before the for loop to initialize $x_min and $y_min, and inside the loop to generate a new test pair $x, $y. The upper limit in the loop variable $i is $#vertices, which is the index of the last member of @vertices. Perl automatically keeps one of these variables for each array. The last statement in this routine, $index; simply establishes the return value.

The inside subroutine is our "workhorse" function. It is admittedly fairly complicated, but you should be able to follow the logic if you have read this far. Here is some help. The variable $index is used to walk from vertex to vertex around the polygon starting from the index returned by lower_left_index. If this index is anything other than zero, it will need to be reset after the vertex with the largest index is encountered. The variable $check_index keeps track of both whether this resetting will be necessary and, if so, whether it has been done yet. The variable $last_index is the index of the last vertex that was not on the test line. Generally this is the index of the vertex "behind" $index.

The OUTER label takes advantage of one of Perl's handiest features, the ability to exit effortlessly from nested loops. You can do this in C if you like goto. In the present example the polygon sides are created via the join function using the colon as the separator. The sides are stored in $lp. If the test point is on a polygon edge, there is no need to test further, so the OUTER loop is exited immediately. Note that setting $count to one in this case is equivalent to counting the boundary as inside the polygon, since one is an odd number. It would be trivial to count the boundary as outside or even as a special case by modifying the if block containing the statement: last OUTER;

If the point is not on an edge, a horizontal line from the test point to an outside point with an x coordinate equal to $big_float is constructed and stored in $lt. The remainder of this function tests for either line intersection or "sideness" depending on whether the previous vertex was on the test line, incrementing $count as appropriate. The return value of the inside function is 1, if the point is inside, and 0 if the point is outside.


 
#!/usr/local/bin/perl

while(<DATA>)
   {
   chop;
   $polygon .= $_ . ":";
   }

chop $polygon;

for(;;)
   {
   print "Enter x and y separated by a comma (q to quit): ";
   chop($point = <STDIN>);
   last if $point =~ /^[qQ]/;

   print("No comma!  Try again.\n"), redo if $point !~ /,/;
   $point =~ s/ +//g;
   print "Checking point: $point\n";

   printf "%s\n", &inside($point, $polygon) ? "inside" : "outside";
   }

__END__
5.0,4.0
0.0,0.0
10.0,5.0
0.0,10.0
Figure 8. Driver Routine for Inside Subroutine


A simple driver routine for the inside subroutine is presented in Figure 8. This routine reads its data from the end of the perl program itself. Anything following the line:

__END__
is considered data and is accessed through the (automatically opened) DATA file handle. Two new operators introduced in this driver are "." which concatenates two strings and ".=" which appends one string to another. That is, ".=" is to "." as "+=" is to "+". The chop function removes the last character from each element of its argument list. Note how the line:
chop $polygon;
trims the final colon from the polygon string, so that it is a legitimate polygon. Replace my data with your own if you want to run this driver, but be sure to put a comma between the x and y coordinates.

Summary

You have seen two examples of how Perl can be used for prototyping. I hope that from these examples you have gained a feel for Perl's syntax. More importantly, I hope that you have seen how using Perl can free you from concentrating on programming specific details, like memory allocation. Instead, you can direct your efforts toward getting your algorithm up and running. I have discovered that, in many cases, the Perl prototype was sufficient for my purposes, saving me the time of coding the program in C or C++ at all!

When I do recode a prototyped algorithm from Perl to another language, I have found that it is easy to change gears. The logic is behind me, freeing me to concentrate on C specifics, memory allocation/deallocation, input/output, error reporting, etc.

My suggestion to the reader is to program a simple application in Perl and see for yourself how this very elegant and powerful language works. You may not save any time with the first program or two, but it will not be long before the benefits of Perl appear. If you feel ambitious, try writing a routine to replace my point_on_line. I mentioned earlier that my algorithm for testing whether a point is on a line is not very efficient. Another, more efficient scheme, is to first check whether the point's x coordinate is within the x range of the line and, if so, whether the point's y coordinate satisfies the equation of the line. Vertical lines are special cases.

Among the many algorithms I have prototyped in Perl are LZW data compression (the same as used in the UNIX compress utility), RSA encryption, many matrix operations including eigenvalue/eigenvector determination and a code generator that outputs C code from a database. I even have a little program called "perls" that reads a database of perl programming tips and prints a random tip to the screen. [I can provide this program to The Linux Journal and/or its audience via Internet. Let me know if you are interested.]

[Yes, we are. We would like to put it on our web site, perhaps even in a cgi script.]

Jim Shapiro is a consultant specializing in programming mathematical algorithms. He is presently developing a GIS system for a telecommunications company. When he isn't on his Linux system hacking away in C or Perl he can often be found on the squash courts. Jim is a founding member of LUGOR, the Linux User's Group Of the Rockies.


Perl Resources

Programming Perl by Larry Wall and Randal L. Schwartz, O'Reilly & Associates, Inc., 1992. If you are serious about learning Perl, this is the book to read. It is all here, including some very sophisticated examples. Not recommended for beginners, however.

Learning Perl by Randal L. Schwartz, O'Reilly & Associates, Inc., 1993. A tutorial divided into lesson sized chapters.

Teach Yourself Perl in 21 Days by David Till, SAMS publishing, 1995. My personal favorite. Looks more daunting (841 pages) than it is. I got so excited I read it in seven days. Read this one, then "Programming perl", and you will soon be an expert.

The "man" pages. Not bad if you want to get the flavor of the language, but mine seem dated.