Peter M. Aarestad

The personal site of Peter Aarestad.

View on GitHub
26 February 2015

Formal Languages, part 2: Regular Expressions and Regular Languages

by Peter Aarestad

Before I start in on this entry, I need to give credit where credit is due and note that my main reference source for these blog posts on formal languages is my grad school textbook, An Introduction to Formal Languages and Automata, Third Edition by Peter Linz (the 5th edition is linked). If you’re out there, Peter, thanks for writing an awesome intro to this nerdy topic, and congratulations on having a great first name. :)

So let’s talk about regular expressions. Again, we’re not talking about the Perl version with lazy matching and look-ahead and look-behind and all that magic; we’re discussingaustere, formalized regular expressions. The idea is that a regular expression matches a set of strings, and those strings make up aregular language. (This is jumping the gun a bit because we actually define regular languages using finite automata, but we’ll come back to that later.) For formality’s sake, we first note that the empty set ∅ and the empty string λ themselves are regular expressions; the former is, of course, a set of zero sentences, while the latter is a set of one sentence, namely the empty string. In addition, any membera in a given alphabet Σ is also a regular expression of a single sentence one symbol long. We then add on concatenation (r1·r2, or more simply, r1r2), alternation (r1 + r2, which is expressed in Unix and Perl REs as r1 | r2, and means “r1 or r2”) and star-closure (r*, i.e. zero or morers in a row), where the rs are composed of the primitive regular expressions ∅, λ, and all a ∈ Σ. (We almost always ignore the ∅ and λ, however, and just talk about combinations of symbols in the alphabet Σ.) We can also group sub-expressions with parentheses. So this way, we can say things like “a language with an even number of as followed by an odd number of bs”, which would be expressed as:

r = (aa)(bb)b

To break that down, we first see that (aa)* represents any number of aas: 0, 1, 2, etc., and so gives us an even number of as. The second part, (bb)*b, indicates that we again have an even number of bs, but then there is one extra b at the end, giving us an odd number. We can give many examples of regular expressions: “all bit strings (i.e. strings of 0s and 1s) which has at least one pair of consecutive 0s”, “all strings on an alphabet such that the length is divisible by 3”, “all bit strings that, when interpreted as a binary number, is greater than or equal to 39”, and so on. So it seems like we can define quite a few complicated languages using just this simple way of expressing patterns! However, remember the example from part 1:

/^1?$|^(11+?)\1+$/

This isn’t a true regular expression, as we noted - you can’t do things in regular expressions like “lazy matching” or “back-referencing”. All we can say is “this followed by that”, “this or that” and “zero or more thises”. But maybe it’s possible that there is a regular expression that does what the above does, namely match all composite numbers expressed in unary notation. Try if you’d like, but you’ll soon see it’s futile. In another post, I’ll explain the goofily-named pumping lemma which is used to prove that a particular grammar does not represent a regular language, and thus cannot be represented by a regular expression. We’ll then use the pumping lemma to prove that we can’t construct a regular expression for this language.

I think that’s enough for now. Next time, we’ll discuss finite automata and how they’re used to truly define regular languages. As a preview, we’ll see that there are two types: deterministic automata which require that you always know what to do next, and non-deterministic automata which allows “guesses” as to how to proceed. The magic part, as we’ll see, is that they are equivalent! But you’ll have to come back later to see what I’m talking about. :)

tags: