Let's set the scene. You have a file full of website URLs (https://wwww.google.com
, etc.), and you want to grab only the ones that use https
, not the ones using http
. Great-- you Control+F, type in "https
", and your file editor picks out only those URLs starting with "https
". But what if you wanted to select only https
pages ending with .com
? Will your file editor search for both of those things simultaneously, or will you be forced to manually look at each each https
site it finds and check for that .com
ending yourself?
The problem can get worse. Suppose you have a file with URLs, some metadata (site title, timestamp from the last update, etc.), and an IP address (8.8.8.8
, 192.168.1.75
, etc.). How would you tell your file editor to select every single IP address? Questions like these are frequent and variable when you're doing data processing, general programming, input parsing, or just finding text on a page.
Since the mid-1960s, the programmer's solution to these questions has been regular expressions. We input a string of characters to specify what kind of data we want to match, and a regular expression parser (provided in most programming languages, code development environments, and some file editors) uses it to select the data we want.
To solve the https
/ http
problem, we could just use the regular expression "^https
"; if we want to get fancy, we can select the rest of the URL after that https
with "^https.*
". What about selecting https
and the .com
ending? We could use the regex "^https.*\.com$
". Even that IP address problem can be solved with a regex like "(\d{1,3}\.?){4}
". It's not an intuitive system, and it's not easy for anyone to read or write-- the joke goes that regular expressions create more problems than they solve:
There are some truly horrifying regular expressions out there, the type that make you want to curl up in a corner and rock yourself, but I'd like to show you my personal favorite. If you have children, hide their eyes. If you're squeamish, you may want to leave.
^(?!.?$|(..+?)\1+$).*$
What does that abomination do? Why, it matches prime numbers, of course!
I'm not joking. It matches a line of text if that line of text consists of one character -- x
, for example -- and if that character is repeated a prime number of times. It matches "xxx
", but doesn't match "xxxx
". It matches seventeen x's in a row, but it doesn't match sixteen. Go copy-paste it into this regular expression tester if you want to poke at it. Here's what it'll look like:
The interesting question is why this regex works, and that's what I want to talk about. There's a bit of meat on these bones, especially for the non-technical crowd. Let's break it apart in chunks.
The first chunk is the negative lookahead and selector, ^(?!
...).*$
. The ^
targets the start of the line, "position 0", before any characters. $
targets the end of the line, which falls right after the last character. .
targets any character (for example, b.ng
matches "bang
and "bing
", and bada b...
matches "bada bing
" and "bada boom
", and *
specifies we should target zero or more of the previous pattern. ^.*$
finds the start of a line and matches every character until the end of the line. That's how we would select everything on a line (or we could just do .*
).
What does that (?!
...)
part do? Let's start with its cousin, (?=
...)
, which "looks ahead" without matching anything, and only continues processing the rest of the regex if it sees the contents of the parentheses. For example, \d(?= dollars)
would match only the 5
in "I have 5 dollars
", and wouldn't match anything in "I have 5 euro
". By contrast, (?!
...)
is the "negative look-ahead", which also "looks ahead" without matching anything, and only continues processing the rest of the regex if it doesn't see the contents of the parentheses-- so \d(?= dollars)
wouldn't match "I have 5 dollars
", but it would match the 5
in "I have 5 euro
".
Let's put it all together with an example. ^(?!orange).*$
starts at the beginning of a line, looks ahead to see if the next few characters spell out "orange
", and if they do, it stops without matching anything. If they don't -- for example, if the next few characters are "apple
" -- it then grabs those characters and then every other character until the end of the line. This pattern selects everything but the lookahead target.
Why is this surrounding our prime-matching regex ^(?!.?$|(..+?)\1+$).*$
? It's a lot easier (and faster!) to match things that aren't prime numbers, so that's what .?$|(..+?)\1+
does: it matches composite (non-prime) numbers. Everything it doesn't match is a prime number, and we use our "everything-but" pattern to match those instead.
Let's look at the rest of our regex.
The |
symbol is used for the operation "or"-- it'll match either the stuff on the left or the stuff on the right. For example, apple|orange
will match either "apple
" or "orange
". In our prime-matching regex, we'll match either .?$
or (..+?)\1+$
. Let's pick apart what each of these sides will match on their own.
Start with .?$
. We know that .
will match any character, and we know that $
matches the end of the line. The only new character is ?
, which is similar to *
, except that it matches 0 or 1 of the previous regex pattern. We could use it to find both "apple
" and "apples
" with apples?
, for example. Now we can see that .?$
matches zero characters (the line is empty) or one character (maybe a single x
). Remember that we're trying to match non-prime things-- 1 isn't prime, and we want to ignore empty lines, so this is perfect. Now we just have to cover the rest of the non-prime numbers!
That's what the second part, (..+?)\1+$
, does. To pick this apart, let's begin with the (
...)\1+
. The parentheses define something called a subgroup. It doesn't contribute to the final match, but we can invisibly capture text with it and use that text as a variable in our regex. We reference subgroups in the order that they were created, right-to-left, as a backslash and then their number; for example, the first one is always \1
. An example might be helpful: (shakira)
won't visibly match any part of "shakirashakira
" because (shakira)
is a subgroup, but (shakira)\1
will match the first "shakira
" and (shakira)\1\1
will match the whole string.
We see that same structure in (..+?)\1+$
. We know that the first bit, ..+?
, is some subgroup. The +
is similar to *
-- it selects one or more of the thing in front of it -- so we can read \1+
as the captured subgroup repeated one or more times. We also know that the $
will match the end of the line. Therefore we're finding a subgroup and seeing if the rest of the string is repetitions of that subgroup. A similar regex would be (aaa)\1+$
matching "aaaaaa
" (the pattern repeated twice) but not "aaaaaaa
" (the pattern repeated twice with one "a
" left over).
So what's the deal with that subgroup pattern, ..+?
? We know that .
matches any character, and the next bit, .+
, will match one or more of any character. Together they match two or more characters.
That ?
means a different thing than we've seen previously. When applied to a pattern (a
, .
, pineapple
, etc.), it'll match 0 or 1 instances of that pattern, like it did in the .?$
part; but when applied to a quantifier (*
, +
, etc.), it makes that quantifier "lazy"-- that is, the quantifier selects the smallest number of characters that it can to meet the criteria. This overwrites the default behavior of quantifiers, which is to be "greedy", selecting as many characters as possible. For example, s.*g
would select swinging
from "swinging
", but we could add the "lazy" quantifier to make s.*?g
and grab only swing
. In the case of ..+?
, it'll try its best to match only 2 characters, then failing that, 3 characters; failing that, 4 characters; and so on until it matches something successfully.
The bright-eyed and bushy-tailed folks have spotted it already, but let's put it all together anyway, step-by-step. This pattern matches an empty line or a singular character on a line:
.?$
This tries to match two characters, then three characters, then four, and so on until it gets a match or exceeds the length of the line:
..+?$
This tries to match the whole line if its length is a multiple of two characters (but not just two characters), then failing that, a multiple of three characters (but not just three), then four, and so on until it gets a match or exceeds the length of the line:
(..+?)\1+$
This tries to match a line with length 0, 1, any multiple of 2, any multiple of 3, any multiple of 4, and so on. In other words, albeit with some inefficiencies, it matches non-prime numbers:
.?$|(..+?)\1+$
This starts at the beginning of the line, checks whether the line is a non-prime number, and if it isn't (i.e. it's a prime number), it matches the entire line:
^(?!.?$|(..+?)\1+$).*$
Hey, you did it! You can now match prime numbers with what appears to be a keysmash used as a swear word in a comic. This wizard knowledge is yours to keep forever. And hey, if you understood any of the examples or sub-steps, you're now better with regular expressions than 99% of the human population. Congratulations! Go make problems for yourself and other people.