New in “Working with Bits”

You can read the draft of Chapter 16 in O’Reilly Atlas.


Bits and bit vectors in Perl haven’t changed since the first edition, so there’s not much to update in this chapter. I thought that Abigail’s prime number regex might deserve some space, but it turns out that it didn’t.

I also thought that the octal prefix 0o had made it in since it’s proposal back in the v5.15 days. It had some interesting parsing problems, and eventually the proposal was dropped.

When I last worked on this book, I was running a 32-bit perl with v5.8. Now I have a 64-bit v5.18. The output of some of the Devel::Peek examples changed a little, so I updated those.

I added an example of using a bit vector to cache the positions of prime numbers. I can create a big string where each bit represents one number. When it’s time to check if a number is prime, I simply check the right bit.

Finally, at the end of the chapter, I updated the URLs for “Further Reading”. In eight years the URLs have moved around a bit.

Abigail’s prime number checker regex

As I was working on the “Working with Bits” chapter of Mastering Perl, I thought about the prime number checker from Abigail. I’ve known about that for years. Abigail came up with it in 1998, and it’s listed in the CPAN JAPH file. I never bothered to look into how it worked since I figured it was something clever:


% perl -E 'say "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 1234567

His use of 1 made me think there might be something interesting to say about bit vectors, but that turned out not to be so. I could use any character (or sequence) in place of the 1.

Still, I’ll break it down. The program has this structure:

say "Prime" if STRING !~ REGEX

The !~ is a negated match. The condition is true if the string does not match the regular expression. If the string matches, then it’s a composite number so the condition is false. It might make you feel better to negate the condition with unless instead:

say "Prime" unless STRING =~ REGEX

Abigail’s pattern looks a bit more complicated than it really is because of the 1 character being both a literal and a back reference:

^1?$|^(11+?)\1+$

When I see regexes where I don’t immediately see what is going on, I look for some hook in it so I can break it down. In this regex, there’s the alternation, and both sides of the alternation use the beginning of line (^) and end of line anchors ($). I suspect Abigail did that for symmetry and to avoid the parentheses he’d need to solve the precedence problem. Otherwise, it would look like this, spread out a bit:

^ (?:  1? | (11+?) \1+ ) $ 

Assuming that the anchors match, the inside portion is two branches:

1? | (11+?) \1+

It’s one of these:

1?
(11+?) \1+

The first one is easy. It’s zero or one characters, and nothing more. That matches the trivial condition where the input number is 1. If it’s not exactly one character, the regex tries the other side of the alternation:

(11+?) \1+

There are two parts to this: the capture and the back reference. The back reference can match one or more times. So, it’s looking for complete groups, and nothing more, of the same thing. The actual thing we match hardly matter. It comes down being able to break up the string evenly into groups.

For example, I take the number nine. The string for that is a sequence of nine 1‘s:

111111111

Can I break up that string into equal groups with nothing left over? Sure; there are three groups of three:

111 111 111

The number of groups represent one factor, and the length of any one of the groups represents another factor. Since it finds two factors, the number must be composite, so not prime.

I can do it again with seven:

1111111

There isn’t a way to create equally-sized groups. The regex tries the longest string it can for the capture, then checks that it can match the back reference one or more times. It keeps failing:

111111 1
11111 11
1111 111
111 111 1
11 11 11 1

Thus, the pattern doesn’t match 1111111, so the number is prime because there’s not another factor.

That’s the basic idea, but the regex engine gets there in many more steps. I can see how it tries the longest first group, fails, and keeps backtracking. I ran this under Regexp::Debugger.

That’s all nice, but there’s nothing about bit vectors there. Oh well.

Mastering Perl 2e draft in O’Reilly Atlas

O’Reilly Media has been moving to a new publishing platform that’s making things much easier for everyone I think. Mastering Perl, Second Edition is my first experience with this.

Besides moving from Subversion to Git (yay!), my intermediate draft sources are available in real time through O’Reilly Atlas. You can read what I’ve got so far for free, sponsored by OSCON. Mostly what you’ll see there right now is the first edition since I’ve been slow to move over what I was doing in the old process to the new process.

This is your chance to influence what happens in the second edition. The structure and topics mostly stay the same, but everything will be updated.

(Mis)adventures in benchmarking

In Adventures in Benchmarking, Part 1, David Golden ran a benchmark program to compare Path::Tiny‘s slurp against itself several times. He thought he had discovered a problem with the ordering of code in Benchmark, which runs the code in the lexigraphical order of their labels. He created several subroutines that did the same thing:


use v5.10;
use strict;
use warnings;
use Benchmark qw( cmpthese );
use Path::Tiny;
 
my $file = "ascii-large";
 
my $count = -1;
cmpthese(
    $count,
    {
        map { $_ => sub { path($file)->slurp } } ("a" .. "l")
    }
);

His results looked surprising, and I was a bit excited about this because I was working on the “Benchmarking” chapter of Mastering Perl the day that I read this. These numbers would make an excellent story:

l 1599/s   -- -14% -15% -19% -20% -20% -21% -21% -21% -23% -23% -23%
h 1866/s  17%   --  -1%  -6%  -6%  -7%  -8%  -8%  -8% -10% -10% -10%
i 1879/s  17%   1%   --  -5%  -6%  -6%  -8%  -8%  -8%  -9%  -9%  -9%
j 1981/s  24%   6%   5%   --  -1%  -1%  -3%  -3%  -3%  -4%  -4%  -4%
f 1995/s  25%   7%   6%   1%   --  -0%  -2%  -2%  -2%  -3%  -4%  -4%
b 1999/s  25%   7%   6%   1%   0%   --  -2%  -2%  -2%  -3%  -3%  -4%
g 2033/s  27%   9%   8%   3%   2%   2%   --  -0%  -0%  -2%  -2%  -2%
c 2035/s  27%   9%   8%   3%   2%   2%   0%   --  -0%  -2%  -2%  -2%
a 2036/s  27%   9%   8%   3%   2%   2%   0%   0%   --  -1%  -2%  -2%
e 2067/s  29%  11%  10%   4%   4%   3%   2%   2%   2%   --  -0%  -0%
k 2070/s  29%  11%  10%   4%   4%   4%   2%   2%   2%   0%   --  -0%
d 2074/s  30%  11%  10%   5%   4%   4%   2%   2%   2%   0%   0%   --

I tried it myself on my Mac Pro running OS X.8 with a Perl v5.14.2 compiled with the defaults. I didn’t get the same dramatic results:

macpro> perl5.14.2 path-tiny.pl /Volumes/Perl/BackPAN/modules/06perms.txt.gz
     Rate   l   j   g   k   a   b   i   e   d   f   c   h
l 1382/s  -- -1% -1% -1% -1% -1% -1% -1% -1% -2% -3% -3%
j 1395/s  1%  --  0% -0% -0% -0% -0% -0% -0% -1% -2% -2%
g 1395/s  1%  0%  -- -0% -0% -0% -0% -0% -0% -1% -2% -2%
k 1395/s  1%  0%  0%  -- -0% -0% -0% -0% -0% -1% -2% -2%
a 1395/s  1%  0%  0%  0%  -- -0% -0% -0% -0% -1% -2% -2%
b 1395/s  1%  0%  0%  0%  0%  -- -0% -0% -0% -1% -2% -2%
i 1395/s  1%  0%  0%  0%  0%  0%  -- -0% -0% -1% -2% -2%
e 1399/s  1%  0%  0%  0%  0%  0%  0%  -- -0% -1% -2% -2%
d 1400/s  1%  0%  0%  0%  0%  0%  0%  0%  -- -1% -2% -2%
f 1408/s  2%  1%  1%  1%  1%  1%  1%  1%  1%  -- -1% -1%
c 1422/s  3%  2%  2%  2%  2%  2%  2%  2%  2%  1%  -- -0%
h 1422/s  3%  2%  2%  2%  2%  2%  2%  2%  2%  1%  0%  --

macpro> perl5.14.2 path-tiny.pl /Volumes/Perl/BackPAN/modules/06perms.txt.gz 
    Rate   a   b   l   k   i   c   e   d   j   h   f   g
a 1370/s  -- -1% -2% -2% -2% -2% -2% -2% -2% -2% -3% -3%
b 1382/s  1%  -- -1% -1% -1% -1% -1% -1% -1% -1% -2% -2%
l 1395/s  2%  1%  --  0%  0% -0% -0% -0% -0% -0% -1% -1%
k 1395/s  2%  1%  0%  --  0% -0% -0% -0% -0% -0% -1% -1%
i 1395/s  2%  1%  0%  0%  -- -0% -0% -0% -0% -0% -1% -1%
c 1395/s  2%  1%  0%  0%  0%  -- -0% -0% -0% -0% -1% -1%
e 1395/s  2%  1%  0%  0%  0%  0%  -- -0% -0% -0% -1% -1%
d 1395/s  2%  1%  0%  0%  0%  0% -0%  -- -0% -0% -1% -1%
j 1399/s  2%  1%  0%  0%  0%  0%  0%  0%  --  0% -1% -1%
h 1399/s  2%  1%  0%  0%  0%  0%  0%  0%  0%  -- -1% -1%
f 1408/s  3%  2%  1%  1%  1%  1%  1%  1%  1%  1%  -- -0%
g 1408/s  3%  2%  1%  1%  1%  1%  1%  1%  1%  1%  0%  --

More interestingly, my results seem to disprove his assertion that there are internal ordering effects at play. That is, Benchmark::Forking doesn’t matter. I was curious about that in his results too, but with modern filesystems doing what they do, the side effects are likely outside of his program. The ordering of the tests relative to each other is unrepeatable, even when I switch to Benchmark::Forking, which only isolates effects inside the program.

I figured this might not be an issue with my particular operating system and setup, so I also tried it on a FreeBSD machine. I got the same non-results. There’s no consistent ordering of tests relative to each other with Benchmark or Benchmark::Forking, and there’s no consistent range of values.

freebsd> perl5.10.1 path-tiny.pl dp-dump.mysql
    Rate    b    h    j    i    a    l    k    g    f    d    e    c
b 1137/s   --  -7% -11% -11% -11% -11% -12% -12% -12% -12% -13% -13%
h 1218/s   7%   --  -4%  -4%  -5%  -5%  -5%  -5%  -5%  -6%  -6%  -6%
j 1273/s  12%   4%   --   0%  -1%  -1%  -1%  -1%  -1%  -2%  -2%  -2%
i 1273/s  12%   4%   0%   --  -1%  -1%  -1%  -1%  -1%  -2%  -2%  -2%
a 1283/s  13%   5%   1%   1%   --  -0%  -0%  -0%  -0%  -1%  -2%  -2%
l 1283/s  13%   5%   1%   1%   0%   --  -0%  -0%  -0%  -1%  -2%  -2%
k 1286/s  13%   6%   1%   1%   0%   0%   --   0%   0%  -1%  -1%  -1%
g 1286/s  13%   6%   1%   1%   0%   0%   0%   --   0%  -1%  -1%  -1%
f 1286/s  13%   6%   1%   1%   0%   0%   0%   0%   --  -1%  -1%  -1%
d 1295/s  14%   6%   2%   2%   1%   1%   1%   1%   1%   --  -1%  -1%
e 1303/s  15%   7%   2%   2%   2%   2%   1%   1%   1%   1%   --   0%
c 1303/s  15%   7%   2%   2%   2%   2%   1%   1%   1%   1%   0%   --

freebsd> perl5.10.1 path-tiny.pl dp-dump.mysql
    Rate   g   h   f   i   d   j   e   l   k   c   b   a
g 1168/s  -- -1% -2% -2% -4% -4% -5% -5% -5% -6% -7% -8%
h 1184/s  1%  -- -1% -1% -3% -3% -3% -3% -3% -4% -6% -6%
f 1193/s  2%  1%  --  0% -2% -2% -3% -3% -3% -4% -5% -6%
i 1193/s  2%  1%  0%  -- -2% -2% -3% -3% -3% -4% -5% -6%
d 1218/s  4%  3%  2%  2%  --  0% -1% -1% -1% -1% -3% -4%
j 1218/s  4%  3%  2%  2%  0%  -- -1% -1% -1% -1% -3% -4%
e 1227/s  5%  4%  3%  3%  1%  1%  --  0%  0% -1% -2% -3%
l 1227/s  5%  4%  3%  3%  1%  1%  0%  --  0% -1% -2% -3%
k 1227/s  5%  4%  3%  3%  1%  1%  0%  0%  -- -1% -2% -3%
c 1236/s  6%  4%  4%  4%  1%  1%  1%  1%  1%  -- -1% -2%
b 1254/s  7%  6%  5%  5%  3%  3%  2%  2%  2%  1%  -- -1%
a 1264/s  8%  7%  6%  6%  4%  4%  3%  3%  3%  2%  1%  --

I suspect that there are a couple of runs of the target code that take abnormally long and skew the results. These happen at certain times regardless of the code being run. A cache adjusts itself, another program is asking for the I/O channels, a disk has to wake up, or something else. In short, these numbers are useless because they treat every run the same despite what happened.

Dumbbench can handle this situation. It times the code, running it enought time to get to a certain precision by ignoring outliers. The Benchmark::Dumb module is a mostly compatible interface, although I specify a number of maximum iterations and a target precision instead of a number of iterations:

#!/usr/bin/perl
# path-tiny.pl

use Benchmark::Dumb qw( cmpthese );
use Path::Tiny;
 
my $file = $ARGV[0];
 
my $count = 1000.01;
cmpthese(
    $count, {
        map { $_ => sub { path($file)->slurp } } ("a" .. "l")
        }
        );

The ordering of tests is still not consistent, but the spread of results seems more reasonable and repeatable. This is all running in the same Perl process. The results can be a bit wide since the module also reports uncertainties if it can, but the widget thing should give you the raw text:

freebsd> perl5.10.1 path-tiny.pl dp-dump.mysql
           Rate           k           d           g           e           f           c           l           h           i           j          b     a
k 1135.1+-1.5/s          --       -0.6%       -0.6%       -0.7%       -0.7%       -1.0%       -2.1%       -3.2%       -3.7%       -3.9%      -4.5% -4.7%
d 1141.5+-1.6/s  0.56+-0.2%          --        0.0%       -0.1%       -0.1%       -0.5%       -1.6%       -2.6%       -3.1%       -3.4%      -4.0% -4.1%
g 1141.9+-1.6/s   0.6+-0.2%  0.03+-0.2%          --       -0.1%       -0.1%       -0.5%       -1.5%       -2.6%       -3.1%       -3.4%      -4.0% -4.1%
e 1142.9+-1.5/s 0.68+-0.19% 0.12+-0.19% 0.09+-0.19%          --        0.0%       -0.4%       -1.4%       -2.5%       -3.0%       -3.3%      -3.9% -4.0%
f 1142.9+-1.5/s 0.69+-0.19%  0.12+-0.2% 0.09+-0.19%    0+-0.19%          --       -0.4%       -1.4%       -2.5%       -3.0%       -3.3%      -3.9% -4.0%
c 1147.1+-1.4/s 1.06+-0.19% 0.49+-0.19% 0.46+-0.19% 0.37+-0.18% 0.37+-0.18%          --       -1.1%       -2.2%       -2.7%       -2.9%      -3.5% -3.6%
l 1159.6+-1.5/s 2.16+-0.19% 1.59+-0.19% 1.55+-0.19% 1.46+-0.18% 1.46+-0.19% 1.09+-0.18%          --       -1.1%       -1.6%       -1.9%      -2.5% -2.6%
h 1172.4+-1.4/s 3.29+-0.19% 2.71+-0.19% 2.68+-0.19% 2.59+-0.18% 2.58+-0.19% 2.21+-0.18% 1.11+-0.18%          --       -0.5%       -0.8%      -1.4% -1.5%
i 1178.6+-1.2/s 3.83+-0.18% 3.25+-0.18% 3.22+-0.18% 3.13+-0.17% 3.12+-0.17% 2.75+-0.17% 1.64+-0.17% 0.53+-0.16%          --       -0.3%      -0.9% -1.0%
j 1181.6+-1.2/s  4.1+-0.18% 3.52+-0.18% 3.48+-0.18% 3.39+-0.17% 3.39+-0.17% 3.01+-0.17%  1.9+-0.17% 0.79+-0.16% 0.26+-0.15%          --      -0.6% -0.7%
b   1189+-1.7/s 4.75+-0.21% 4.16+-0.21% 4.13+-0.21%  4.04+-0.2%  4.03+-0.2%  3.65+-0.2%  2.54+-0.2% 1.41+-0.19% 0.88+-0.18% 0.62+-0.18%         -- -0.1%
a 1190.5+-1.7/s 4.88+-0.21%  4.3+-0.21% 4.26+-0.21%  4.17+-0.2%  4.17+-0.2%  3.79+-0.2%  2.67+-0.2% 1.54+-0.19% 1.01+-0.18% 0.75+-0.18% 0.13+-0.2%    --

freebsd> perl5.10.1 path-tiny.pl dp-dump.mysql
             Rate    g     l     k     j     i     h     f     a     b     c     d     e
g 1240.26+-0.72/s   -- -0.5% -0.9% -1.0% -1.0% -1.1% -1.4% -1.6% -1.7% -1.7% -1.7% -1.8%
l 1246.11+-0.44/s 0.5%    -- -0.4% -0.5% -0.5% -0.6% -0.9% -1.1% -1.2% -1.2% -1.3% -1.4%
k 1251.06+-0.33/s 0.9%  0.4%    -- -0.1% -0.1% -0.2% -0.5% -0.7% -0.8% -0.8% -0.9% -1.0%
j 1252.18+-0.31/s 1.0%  0.5%  0.1%    --  0.0% -0.1% -0.4% -0.6% -0.7% -0.8% -0.8% -0.9%
i 1252.33+-0.38/s 1.0%  0.5%  0.1%  0.0%    -- -0.1% -0.4% -0.6% -0.7% -0.7% -0.8% -0.9%
h 1253.57+-0.28/s 1.1%  0.6%  0.2%  0.1%  0.1%    -- -0.3% -0.5% -0.6% -0.6% -0.7% -0.8%
f 1257.46+-0.21/s 1.4%  0.9%  0.5%  0.4%  0.4%  0.3%    -- -0.2% -0.3% -0.3% -0.4% -0.5%
a 1260.16+-0.23/s 1.6%  1.1%  0.7%  0.6%  0.6%  0.5%  0.2%    -- -0.1% -0.1% -0.2% -0.3%
b 1261.46+-0.15/s 1.7%  1.2%  0.8%  0.7%  0.7%  0.6%  0.3%  0.1%    --  0.0% -0.1% -0.2%
c 1261.75+-0.15/s 1.7%  1.3%  0.9%  0.8%  0.8%  0.7%  0.3%  0.1%  0.0%    --  0.0% -0.1%
d 1262.23+-0.19/s 1.8%  1.3%  0.9%  0.8%  0.8%  0.7%  0.4%  0.2%  0.1%  0.0%    -- -0.1%
e 1263.52+-0.14/s 1.9%  1.4%  1.0%  0.9%  0.9%  0.8%  0.5%  0.3%  0.2%  0.1%  0.1%    --

freebsd> perl5.10.1 path-tiny.pl dp-dump.mysql
           Rate           f           i           e           h           g           l           k           j           d           c           b     a
f 1186.1+-1.8/s          --       -0.2%       -0.2%       -0.2%       -0.3%       -0.6%       -0.9%       -1.6%       -2.3%       -3.2%       -3.2% -3.4%
i 1188.1+-1.6/s 0.17+-0.21%          --        0.0%       -0.1%       -0.1%       -0.4%       -0.7%       -1.4%       -2.1%       -3.0%       -3.0% -3.3%
e 1188.2+-1.8/s 0.17+-0.21%     0+-0.2%          --       -0.1%       -0.1%       -0.4%       -0.7%       -1.4%       -2.1%       -3.0%       -3.0% -3.3%
h   1189+-1.6/s 0.24+-0.21%  0.07+-0.2%  0.07+-0.2%          --       -0.1%       -0.3%       -0.7%       -1.4%       -2.0%       -3.0%       -3.0% -3.2%
g 1189.8+-1.8/s 0.31+-0.22% 0.14+-0.21% 0.14+-0.22% 0.07+-0.21%          --       -0.3%       -0.6%       -1.3%       -2.0%       -2.9%       -2.9% -3.1%
l 1192.8+-1.8/s 0.57+-0.21%  0.4+-0.21% 0.39+-0.21%  0.32+-0.2% 0.26+-0.22%          --       -0.3%       -1.1%       -1.7%       -2.6%       -2.7% -2.9%
k 1196.8+-1.7/s  0.9+-0.21%  0.73+-0.2% 0.73+-0.21%  0.66+-0.2% 0.59+-0.21% 0.33+-0.21%          --       -0.7%       -1.4%       -2.3%       -2.3% -2.6%
j 1205.5+-1.6/s  1.63+-0.2% 1.46+-0.19%  1.46+-0.2% 1.39+-0.19% 1.32+-0.21%  1.06+-0.2% 0.72+-0.19%          --       -0.7%       -1.6%       -1.6% -1.9%
d 1213.8+-1.4/s 2.34+-0.19% 2.16+-0.18% 2.16+-0.19% 2.09+-0.18%  2.02+-0.2% 1.76+-0.19% 1.42+-0.19% 0.69+-0.18%          --       -0.9%       -0.9% -1.2%
c 1225.3+-1.3/s  3.3+-0.19% 3.13+-0.18% 3.12+-0.19% 3.05+-0.18% 2.98+-0.19% 2.72+-0.19% 2.38+-0.18% 1.64+-0.17% 0.94+-0.16%          --        0.0% -0.2%
b 1225.4+-1.1/s 3.31+-0.18% 3.14+-0.17% 3.13+-0.18% 3.06+-0.17% 2.99+-0.19% 2.73+-0.18% 2.39+-0.17% 1.65+-0.16% 0.95+-0.15% 0.01+-0.14%          -- -0.2%
a 1228.3+-1.3/s 3.56+-0.19% 3.38+-0.18% 3.38+-0.19% 3.31+-0.18% 3.24+-0.19% 2.97+-0.19% 2.63+-0.18% 1.89+-0.17% 1.19+-0.16% 0.25+-0.15% 0.24+-0.14%    --

freebsd> perl5.10.1 path-tiny.pl dp-dump.mysql
             Rate           b           a     h     i     g     e     c     j     f     d     l     k
b   1177.9+-1.2/s          --       -1.1% -2.5% -2.8% -2.8% -2.8% -2.9% -2.9% -3.1% -3.1% -3.1% -3.2%
a   1191.1+-1.1/s 1.12+-0.14%          -- -1.5% -1.7% -1.7% -1.7% -1.8% -1.9% -2.0% -2.0% -2.0% -2.1%
h 1208.64+-0.64/s 2.61+-0.12% 1.48+-0.11%    -- -0.2% -0.2% -0.3% -0.4% -0.4% -0.5% -0.6% -0.6% -0.7%
i 1211.27+-0.69/s 2.83+-0.12%  1.7+-0.11%  0.2%    --  0.0%  0.0% -0.2% -0.2% -0.3% -0.4% -0.4% -0.5%
g  1211.54+-0.5/s 2.86+-0.12%  1.72+-0.1%  0.2%  0.0%    --  0.0% -0.2% -0.2% -0.3% -0.3% -0.4% -0.4%
e 1211.67+-0.55/s 2.87+-0.12% 1.73+-0.11%  0.3%  0.0%  0.0%    -- -0.1% -0.2% -0.3% -0.3% -0.4% -0.4%
c 1213.44+-0.42/s 3.02+-0.11%  1.88+-0.1%  0.4%  0.2%  0.2%  0.1%    --  0.0% -0.1% -0.2% -0.2% -0.3%
j   1213.6+-0.5/s 3.03+-0.12%  1.89+-0.1%  0.4%  0.2%  0.2%  0.2%  0.0%    -- -0.1% -0.2% -0.2% -0.3%
f  1215.2+-0.38/s 3.17+-0.11%  2.03+-0.1%  0.5%  0.3%  0.3%  0.3%  0.1%  0.1%    --  0.0% -0.1% -0.1%
d 1215.57+-0.36/s  3.2+-0.11%        2.1%  0.6%  0.4%  0.3%  0.3%  0.2%  0.2%  0.0%    --  0.0% -0.1%
l 1215.96+-0.36/s 3.23+-0.11%  2.09+-0.1%  0.6%  0.4%  0.4%  0.4%  0.2%  0.2%  0.1%  0.0%    -- -0.1%
k 1216.84+-0.31/s 3.31+-0.11%        2.2%  0.7%  0.5%  0.4%  0.4%  0.3%  0.3%  0.1%  0.1%  0.1%    --

It’s more interesting to use the dumbbench program because I can get the data as a table instead of as a summary. I create a package that I could pass to dumbbench:

package PathTinyTest;

use v5.10;
use strict;
use warnings;
use Benchmark qw( cmpthese );
use Path::Tiny;

my $file = '/Volumes/Perl/BackPAN/modules/06perms.txt.gz';

# special subroutine for dumbbench --package
sub get_subs_to_benchmark { 'a' .. 'l' }

foreach ( get_subs_to_benchmark() ) {
	no strict 'refs';
	*{"$_"} = sub { path($file)->slurp };
	}

1;

I give that package to dumbbench:

% dumbbench --package PathTinyTest --t pathtiny.dat

Once I the table data, I can pass that to R to plot. It’s much easier to use a proper statistical package than fool with Perl modules. This boxplot.r script comes in the Dumbbench distribution:

file <- commandArgs(trailingOnly=T)[1]
print( file )
t <- read.table(
        file, 
        sep="\t",
        header=T,
        fill=T
        )

base <- sub( "(^[^.]+).*", "\\1", file )
image <- paste( base, "png", sep="." )
png( image )

p <- list(
        boxwex = 0.1,
        ylab   = "Times, s"
        )
boxplot( t, pars=p )

I invoke it:

% Rscript --vanilla --slave boxplot.r pathtiny.dat

And I get a plot:

This box plot shows just what I suspected. Most of the interesting data are pretty much the same. Each circle is something far off the mean, and there's one of h that's really far off. The box for h is also very tall while the others are very short. There's a lot of variability in that particular set of runs.

As with all benchmarks, I try again (and again). The box itself represents the middle 50%, or 1 σ, and the whiskers represent 2 σ. Eventhing else is shown as a circle. The black bar is the mean. Steffen gives some gory details about the distribution in The physicist's way out. Imagine his plots as if you are looking down their Y axes and lining them up next to each other.

So, I think David gets it wrong in that particular benchmark. If there was something odd going on, I'd expect the means to be much different from each other. They aren't.

This is the danger I talk about in the "Benchmarking" chapter. The numbers are only good if they let you understand something. Once you think you understand it, you need to come up with a way to poke that particular thing in a way that demonstrates you know why it does what it does. Here we don't know why the numbers and spread come out as they do or how they relate to each other. Getting numbers that you like better doesn't make them more real.

But, as a true scientist, I also can't say that David is wrong. The most I can say is that he hasn't adequately supported his claim about forking. Or, I should say, that his results might only matter for him, which is another thing I stress in the "Benchmarking" chapter. That no one else can reproduce them on different setups might not matter. He isn't trying to prove a new law of the universe. He just needs to show that when he makes that one change, things work out better for him. Everyone else can run his benchmarks to see if it works out better for them.

This is unrelated to his conclusions about Path::Tiny, which probably is faster than the other modules that do similar things.

Benchmarking HTML entity encoding

I benchmark HTML entity encoding with Surveyor::Benchmark::HTMLEntities using the system I describe in Distribute benchmarks. Already Surveyor::App is making things easier for me.

Tokuhirom shipped HTML::Escape, which implements it encoder in XS, as he describes in his blog post about it. It can be a lot faster than the pure Perl HTML::Entities:

% env survey -p Surveyor::Benchmark::HTMLEntities http://www.perl.org/
Fetching http://www.perl.org/
HTML is 14983 bytes
> (418)
< (418)
& (7)
' (103)
Benchmark: timing 10000 iterations of html_entities, html_escape...
html_entities: 13 wallclock secs (13.75 usr +  0.01 sys = 13.76 CPU) @ 726.74/s (n=10000)
html_escape:  1 wallclock secs ( 0.64 usr +  0.00 sys =  0.64 CPU) @ 15625.00/s (n=10000)

The pure Perl fair fight is also much faster in HTML::Escape:

% env PERL_ONLY=1 survey -p Surveyor::Benchmark::HTMLEntities http://www.perl.org/
Fetching http://www.perl.org/
HTML is 14857 bytes
> (416)
< (416)
& (7)
' (103)
Benchmark: timing 10000 iterations of html_entities, html_escape...
html_entities: 14 wallclock secs (13.74 usr +  0.01 sys = 13.75 CPU) @ 727.27/s (n=10000)
html_escape:  7 wallclock secs ( 7.32 usr +  0.01 sys =  7.33 CPU) @ 1364.26/s (n=10000)

There’s a reason for though: HTML::Escape only handles the characters <, >, &, ‘, and “, while HTML::Entities lets me configure the characters to escape and by default also escapes wide characters.

My Surveyor::App made this simple for me. I created the benchmark, but also ran the target code tests to ensure that I’m returning the same thing. Through that I was able to adjust the target code of HTML::Entities to only escape the same things as HTML::Escape. I might have skipped that step otherwise.

And, now knowing this, I updated the Stackoverflow answer for How can I encode a string for HTML?.

Distribute benchmarks

I want to make it easier for people to share benchmarks. It’s not that hard now: you should be able to post the program you wrote, but not many people do that. That might be partly because those programs are ugly, or they are specific to a set-up, or that we don’t have a good way to distribute programs (although we do!).

Last night I created Surveyor::App. I give it a package name and it loads that module, finds all of the subroutines that start with bench_ then benchmarks them.

I started thinking about this after I added a similar feature to Dumbbench. As I messed around with that idea, I worked on a way to also test all of those subroutines at the same time to ensure that they all returned the same thing. I need to ensure that I’m comparing the same thing and the subroutines are doing the same thing.

I took all the boring bits out and reduced my benchmark code to something like this extract of Surveyor::Benchmark::GetDirectoryListing:

package Surveyor::Benchmark::GetDirectoryListing;

sub set_up {
	my( $class, $directory ) = @_;
	unless( defined $directory ) {
		require Cwd;
		$directory = Cwd::cwd();
		}
	die "Directory [$directory] does not exist!\n" unless -e $directory;
	die "[$directory] does not exist!\n" unless -e $directory;
	chdir( $directory ) or die "Could not change to $ARGV[0]: $!\n";
	my @files = glob( '.* *' );
	printf "$directory has %d files\n", scalar @files;
	}

sub tear_down { 1 }

sub bench_opendir {
	opendir my( $dh ), "."; 
	my @f = readdir( $dh );
	}

sub bench_glob {
	my @f = glob(".* *")
	}

__PACKAGE__;

Nothing in that code cares what’s handling the particular benchmark since I’ve moved all of that into controller module called by the survey program:

% survey -p Surveyor::Benchmark::GetDirectoryListing /etc

From there, the normal benchmarking happens for me. If I want to change the backend, the target code shouldn’t have to change.

My intent it that people will write their benchmarks as a module which they can upload to Github or CPAN or wherever, allowing other people to easily run and improve them.

The next step, which I’m not particularly interested in right now, if outputing the run data. I’ve already provided that feature in I added a similar feature to Dumbbench

Playing with Dumbbench

I’ve been playing with Steffen Müller’s Dumbbench module, which applies some statistical analysis to benchmarking. He explains it all in a couple of blog posts:

The benchmarking part is interesting, but the more interesting part is that Dumbbench can make pictures, unlike the boring, old, ASCII-reporting Benchmark.

Steffen used an interface to ROOT, a huge toolkit out of CERN. I couldn’t compile it on my Mac, so I started playing with R to make my plots. Instead of using ROOT, I added a feature to write the data to a table so I could feed that table to Rscript:

$ dumbbench --code='++$i' --code='$i++' --code='$i+=1' --table=code.dat --maxiter 10000
$ Rscript --vanilla --slave r/boxplot.r code.dat

From that I get an impressive picture that looks as good as the one from ROOT:

I don’t want to supply code on the command line, though. Dumbbench has three sources (or, instances). There’s that --code switch and there’s a final command line argument. The module can take subroutines too, so I added a --package switch. I can put my code to benchmark in a file that I can save and edit and reüse.

$ dumbbench --package=SomePackage --table=some_package.dat --maxiter 10000

There are still some things I want to do. I’d love to figure out how R’s read_table can take its data from standard input so I can get rid of the intermediate file and pipe directly to R.

More interesting than that, however, is fancy stuff I might be able to do in that package. Every subroutine should accomplish the same task, and perhaps return the same thing. The package would be able to automatically test each of the subroutines to verify that. I haven’t thought too much about that yet.

The Storable security problem

Recently, people have moved to close, or at least document, a security issue with Storable. This core module serializes and deserializes Perl data structures, and, as in many places in Perl, tries to be more helpful than we really want. In Mastering Perl, I talk about lightweight persistence in Chapter 14; Storable is a big part of that chapter.

There are two major problems. Someone can force Storable to load arbitrary modules, and someone can possibly run unexpected code. Continue reading “The Storable security problem”

Pondering the second edition of Mastering Perl

I’m putting together the proposal for Mastering Perl, 2nd Edition. O’Reilly would like it published by next summer.

I wrote the first edition in 2005 and O’Reilly published it in 2006. Now it’s seven years after I wrote the original. What do I need to update? Many of Perl’s new features have already shown up in Learning Perl and Intermediate Perl, but Mastering Perl has never really been about listing Perl features.

Some things I already have on my mind:

  • Beefing up the logging chapter
  • A better example for the Tie chapter, using NOAA’s Hourly Surface data instead of the made up DNA example
  • Adding JSON to the data persistence chapter
  • Various updates to the Advanced Regex chapter
  • __SUB__ examples
  • Some fancier examples for the modulinos chapter
  • Devel::NYTProf in the benchmarking chapter

I don’t know if I need to add any major topics. What do you think?