Category Archives: Working with Bits

New in “Working with Bits”

You can read the draft of Chapter 16 in O’Reilly Atlas.


Bits and bit vectors in Perl haven’t changed since the first edition, so there’s not much to update in this chapter. I thought that Abigail’s prime number regex might deserve some space, but it turns out that it didn’t.

I also thought that the octal prefix 0o had made it in since it’s proposal back in the v5.15 days. It had some interesting parsing problems, and eventually the proposal was dropped.

When I last worked on this book, I was running a 32-bit perl with v5.8. Now I have a 64-bit v5.18. The output of some of the Devel::Peek examples changed a little, so I updated those.

I added an example of using a bit vector to cache the positions of prime numbers. I can create a big string where each bit represents one number. When it’s time to check if a number is prime, I simply check the right bit.

Finally, at the end of the chapter, I updated the URLs for “Further Reading”. In eight years the URLs have moved around a bit.

Abigail’s prime number checker regex

As I was working on the “Working with Bits” chapter of Mastering Perl, I thought about the prime number checker from Abigail. I’ve known about that for years. Abigail came up with it in 1998, and it’s listed in the CPAN JAPH file. I never bothered to look into how it worked since I figured it was something clever:


% perl -E 'say "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 1234567

His use of 1 made me think there might be something interesting to say about bit vectors, but that turned out not to be so. I could use any character (or sequence) in place of the 1.

Still, I’ll break it down. The program has this structure:

say "Prime" if STRING !~ REGEX

The !~ is a negated match. The condition is true if the string does not match the regular expression. If the string matches, then it’s a composite number so the condition is false. It might make you feel better to negate the condition with unless instead:

say "Prime" unless STRING =~ REGEX

Abigail’s pattern looks a bit more complicated than it really is because of the 1 character being both a literal and a back reference:

^1?$|^(11+?)\1+$

When I see regexes where I don’t immediately see what is going on, I look for some hook in it so I can break it down. In this regex, there’s the alternation, and both sides of the alternation use the beginning of line (^) and end of line anchors ($). I suspect Abigail did that for symmetry and to avoid the parentheses he’d need to solve the precedence problem. Otherwise, it would look like this, spread out a bit:

^ (?:  1? | (11+?) \1+ ) $ 

Assuming that the anchors match, the inside portion is two branches:

1? | (11+?) \1+

It’s one of these:

1?
(11+?) \1+

The first one is easy. It’s zero or one characters, and nothing more. That matches the trivial condition where the input number is 1. If it’s not exactly one character, the regex tries the other side of the alternation:

(11+?) \1+

There are two parts to this: the capture and the back reference. The back reference can match one or more times. So, it’s looking for complete groups, and nothing more, of the same thing. The actual thing we match hardly matter. It comes down being able to break up the string evenly into groups.

For example, I take the number nine. The string for that is a sequence of nine 1‘s:

111111111

Can I break up that string into equal groups with nothing left over? Sure; there are three groups of three:

111 111 111

The number of groups represent one factor, and the length of any one of the groups represents another factor. Since it finds two factors, the number must be composite, so not prime.

I can do it again with seven:

1111111

There isn’t a way to create equally-sized groups. The regex tries the longest string it can for the capture, then checks that it can match the back reference one or more times. It keeps failing:

111111 1
11111 11
1111 111
111 111 1
11 11 11 1

Thus, the pattern doesn’t match 1111111, so the number is prime because there’s not another factor.

That’s the basic idea, but the regex engine gets there in many more steps. I can see how it tries the longest first group, fails, and keeps backtracking. I ran this under Regexp::Debugger.

That’s all nice, but there’s nothing about bit vectors there. Oh well.