16 February 2015

Formal Languages, Part 1: Basic Concepts

by Peter Aarestad

Last week, there was a link posted on Hacker News that described a regular expression that only matches non-prime numbers (expressed as a string of “1”s, i.e. in so-called “unary” notation):

/^1?$|^(11+?)\1+$/

Commenters noted that this is not a real regular expression. But what does that mean? Anyone with experience in Perl regular expressions can see that this clearly will work just fine; the linked article does a good job of breaking down how this goes about matching (or not matching) a given string of 1s. You see, the history of Unix has sort of muddied the term “regular expression” to the point that it means something in practice that is broader than the theory. Originally, the old Unix command line tool grep was designed with classical “regular expressions” in mind. However, as is well-known, classical regular expressions are very limited, and so a separate tool called egrep was written to accept so-called “extended regular expressions”, and eventually egrep’s notion of an extended RE is what we now call a “regular expression”.

So what is a “classical” regular expression? To explore that question, we need to look into the field called “formal languages”. Despite the name, the field has little to do with programming languages or spoken languages, although it is used in theoretical studies of these kinds of language. Formal language study is the study of patterns, in a way. It asks: what is the simplest pattern which can describe this (usually infinite) set of strings? So to understand what a regular expression is, we need to set down some definitions first that will allow us to talk about formal languages in general. Once we do that, we can talk about the family of so-called “regular languages” that regular expressions encompass, and go beyond that to define more powerful languages.

Basic concepts of formal languages

A language is just a set of strings that are constructed from an alphabet. An alphabet is simply a finite set of symbols such as ‘a’, ‘b’, ‘c’, etc. Traditionally, an alphabet is denoted with the symbol Σ. Strings are composed of zero or more symbols from the alphabet - the so-called empty string is traditionally denoted by λ.

Since strings can contain many consecutive copies of a particular symbol, we can use a shorthand to represent repetition:

aⁿ = aaaaaa… (n times)

For n = 0, we define

a⁰ = λ

When starting with an alphabet Σ, it’s useful to talk about the possible strings it can make. Σ^* is the set of all strings consisting of zero or more symbols from Σ. So say that our alphabet was very simple: it just consisted of {a}. In this case,

Σ^* = { λ, a, aa, aaa, aaaa, … }

or, more succintly,

Σ^* = { aⁿ : n ≥ 0 }

Σ^* always contains λ; if we want to exclude it, we define

Σ⁺ = Σ^* - { λ }

A language, then, is just a subset of Σ^* - perhaps in a way we can define succinctly, perhaps not. A string that is a member of a language is called a sentence. Remember, these are formal, abstract terms, so don’t get too hung-up on the real meaning of “language” or “sentence” here.

Finally, we have the concept of “grammars”. Again, this isn’t like a grammar you’re used to, although it’s quite close! Sentences in various languages can only be constructed in certain ways; some (like English) are subject-verb-object, some (like Japanese) are subject-object-verb; even others use different combinations. So a basic grammar rule in English is expressed as:

[sentence] ⇒ [subject] [verb] [object]

Subjects and objects can be further defined as agglomerations of nouns and articles and such. In the same way, we define grammars with our languages and alphabets:

G = (V, T, S, P)

G is a grammar described as a combination of three sets and one element:

V is defined as a set of variables; as few as one. These are usually denoted with capital letters such as S, A, B, etc.
T is defined as a set of terminal symbols. This is almost always the alphabet Σ that we are already familiar with
S is a single member of V, referred to as the start variable; by convention, it's usually just the letter S. This variable is assumed to be, naturally, the start of any derivation using our grammar.
P is a set of productions similar to our grammar rule above. Productions can contain any number of variables and/or terminal symbols on either side of the arrow, though we specify that the left side contains at least one symbol, while the right side can be empty (i.e. λ).

A language, thus, can be defined as all the sentences such that there is some chain of productions in P called a derivation that produces the sentences - we denote this language as L(G), or the language derived from the grammar G.

To wrap up, here’s a simple grammar:

G = ({S}, {a, b}, S, P)

where P is given by the two productions:

S ⇒ aSb S ⇒ λ

We can derive the sentence “aabb” as so:

S ⇒ aSb ⇒ aaSbb ⇒ aabb

By applying the first rule 0 or more times, followed by the second rule, it’s pretty easy to see that this grammar defines a fairly simple language:

L(G) = { aⁿbⁿ : n ≥ 0 }

So now we can talk about formal languages. Next time, we’ll actually get into regular languages, although along the way we will have to detour and talk about “automata”. Robots? Well, not really. More like a robot that defines a language. :)

tags:

Peter M. Aarestad

The personal site of Peter Aarestad.

Formal Languages, Part 1: Basic Concepts

Basic concepts of formal languages