Coming to an Elementary School Near You: Regular Expressions

It may not come this year, or next year, or even in the next five years, but I hereby predict that the art of regular expressions, or its future equivalent, will become one of the basic literacy skills taught in elementary or, at latest, in the early secondary school curriculum. What’s that? You’ve never heard of “regular expressions” or regex? A regex is just a pattern found in some text, but more specifically refers to the language used to identify these patterns in a larger body of texts. The art of regex is the art of finding things in the ocean of the digital, and very often, manipulating what you have found.

Over the years there have emerged a set of syntax rules for regular expressions which are implemented in many programming languages and computer commands with varying degrees of standardization but, in much simpler versions, they are found in almost every decent search engine you have ever used. When you tell a search engine that you want “Walter Benjamin” AND Arcades, you have created a very simple regular expression. Usually, however, when we refer to regex, we are referring to variations of the rich and highly compact language of regular expressions that evolved out of the Perl programming language and is now found in most languages that came after it. It can be as simple as ^\t to find a tab located at the beginning of a line, or as complex as \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b to find any valid email address (Jason once linked a posting here to this old but still handy guide).

“Ya ya,” I hear some of you saying. “I’ve heard of this, but regular expressions are just something the programmers, geeks, and really heavy duty data mining people do.” To which I would be tempted to reply, “You may return to your sandbox over there in the corner while the kids pass you by.”  No, its nothing like that. Search technologies will continue to improve and provide simple interfaces for finding information. However, there will also be an ever greater need for the power search. As a result, something that is now seen as something of a niche skill, will eventually become an important part of basic digital literacy.

I’m not talking about the digital humanities. I’m talking about all of us, or at least, most of us. If you deal day in and day out with a lot of digital texts or have a lot of files to organize on your computer, I’m talking about you. I would like to propose that there are four general categories of tasks or reasons why regular expressions fit naturally in the curriculum for the children of a digital age and would encourage everyone to learn more about them:

1. Finding and manipulating files and database records – For most things, our computers can find files easily, and our databases turn up the records we want. However, there are lots of tasks where having more power can save us lots of time not just finding files or records but in manipulating them. For example, just today I used regex to rename lots of files in one go. Let’s say you have a folder of Word files with ugly titles that followed a pattern: [some useless number] WklyMtng [date].docx. On the command line I could type rename “s/[0-9]+ WklyMtngNts/Weekly Meeting Notes/” *.docx and I’ve deleted all the unwanted numbers and made the text more readable, while leaving the date untouched.

2. Finding and replacing text within our files that fit certain patterns – We can search and replace in almost every application. Sometimes, however, you want to find things that are not exactly like your search: finding variant spellings, or a phrase that may or may not include a certain word, all the numbers at the beginning of a line, or all dates written in a certain format, etc. and then replace it with something else, or perhaps merely reorder what you have found. Regex lets you do this quickly and easily. This famous xkcd cartoon has really become the banner for the power of regex in this regard.

3. Using regular expressions in the study of large text corpora – This is where the data mining and more quantitative work of the digital humanists come in. Using regular expressions is a powerful way to find patterns across large collections, either to help you think of your material in new ways and generate new questions (discovery), or more controversially, to carry out analysis on the texts based on the findings.

4. Understanding search or, “Doing the math by hand” – When you plug a few words into a search engine, especially when you are searching a smaller collection that you are familiar with (your email, a pdf, a personal research database), the search engine doesn’t always turn up results you expect it to show. Teaching students from a relatively young age how regular expressions or, more broadly, how database queries work is a great way to get them to build more powerful searches on their own, whatever database they come into contact with.

If you use regex and agree it is a skill that ought to be taught widely, how do you think the case should be best made? What have you found it most useful for?

Image: Zeef / Sieve, a Creative Commons licensed image from moosterbroek’s photostream)

Return to Top