Reformatting Confucius with Regular Expressions

SteamBook ProIt was a fine spring afternoon in 1867. Mr. James Legge was just back from China and had settled back into his home in Clackmannanshire, Scotland. When he turned on his SteamBook Pro to check for mail coming in over the Intertubes, he was excited to see news from his publisher about his new popular edition of the Confucian Analects. Was the book already out, perhaps?

He growled in disappointment when he saw the one line message: “CHAPTER NUMBERS ON SEPARATE LINES, PLEASE.”

He opened his manuscript (which you can download here) and saw what the publisher meant: the chapter headings were on the same line as the opening text of the chapter. They were formatted as, for example, CHAPTER I or as CHAP. II followed by a period and a space. For example: CHAP. III. The Master said, …

This was a silly request, he thought. Some of these chapters are only a few lines long and putting the headings on separate lines takes up too much space! But what could he do? What the publisher demands, the publisher gets.

How was he to complete this monumental formatting task on almost five hundred chapters before he left for Aberdeen the next morning? He couldn’t just do a search and replace on the word “CHAPTER” or “CHAP.” because that wouldn’t include the roman numeral for the chapter number, or the period that came after it. Fortunately for him, there was an old man in Malacca who had once taught Mr. Legge the ancient and mystical art of Regular Expressions! Using regular expressions, or regex as its practitioners call it, the task was done by the time his tea kettle started to boil.

In order to see how Mr. Legge was able to carry out this task so quickly let’s walk through using the regular expression he put together for the reformatting. Without using the command line, the easiest way to run a search and replace on a text file using a regular expression is in a text editor that supports regex. There are many text editors which do this. On the Mac some of the most popular include TextMate, BBEdit, and Coda. These all cost money however, so if you don’t own any of them already, try downloading and installing a free one like TextWrangler.

Open Mr. Legge’s translation, which is a simple text file (plain text files are by far the easiest to work with), and choose “Find” from whatever menu it resides in. Depending on what program you are using, you will see an option in that window which says something along the lines of, “Grep” (which is a command line find utility which uses regex), “RegEx.” When you turn this on, the world of regular expressions is now open to you. You enter the regular expressions directly into the “find” and “replace” boxes as you would with a more simple search.

Let’s go over the regular expressions Mr. Legge used to complete his task. My purpose here is not to offer a full beginners guide to regex but to demystify it enough so that you might go on to learn more on your own. To find the chapter headings, Mr. Legge used: CHAP(TER)?\.? ([IVXL]+)\. and he then replaced them with: \r Chapter \2 \r\r

Regular expressions use a combination of literal and meta characters to find what you are looking for in a body of text. Literal characters are actual characters you will find in the text you search. For example “CHAP” is something I’m looking for in the text. Metacharacters are special characters which help identify patterns in the text or perform tasks. Think of it as the grammar of regular expressions. For example, the question mark ? character means, “what preceded me is optional.”

So, looking at the first part of my regular expression you should now be able to make out that that I’m searching for “CHAP” and optionally the additional letters “TER” which I grouped together inside parentheses to indicate that the whole sequence is optional.

Finally there is the escape character which is the backslash \ character. It is used either to tell the regular expression to treat what is normally a metacharacter as a literal character, or to indicate special characters or objects. For example the period “.” usually means “any character except a carriage return.” In order to indicate that I’m looking for a literal period and that I’m not trying match any character, I precede it with a \. Putting this altogether produces: CHAP(TER)?\.?

In English this means, “Find CHAP (and optionally the characters TER after it) followed by a period (which is also optional, since you won’t find it when chapter is not abbreviated)

The next part of the regular expression uses something called a character class. A character class indicates a range of characters which you find acceptable in the text you search. In this case, Mr. Legge knows that after CHAP. or CHAPTER there will be a roman numeral, which consists of one or more I, V, X and, though he doesn’t remember if any books had anywhere close to 50 chapters, perhaps an L character. To indicate this in a regular expression we put the possible characters inside brackets [IVXL]. The + character that follows it means, “match at least one or more of the preceding” so it will match I, or II, or IV, or XIV without difficulty. Finally, I add one more escaped period character to indicate a literal period, which also appears to follow every chapter number.

The parentheses play two roles. As we saw, in (TER) it grouped these together so that we could add a ? and indicate that it was all optional. I’m using the second set of parentheses in ([IVXL]+) for a different purpose. Every time you put parentheses around something it gets saved, in a numbered box if you like, so you can use it later. I can later refer to it by its number. In this case, I want to save the contents of the second pair of parentheses since they contain the chapter number, which will become available to me anywhere I put a \2.

So to recap, the whole regular expression to find the chapter headings throughout the manuscript is:

CHAP(TER)?\.? ([IVXL]+)\.

which in English means:

Find CHAP (and optionally the characters TER after it) followed by a period (which is also optional, since you won’t find it when chapter is not abbreviated), followed by a space, and then a roman numeral containing one or more I, V, X, or L characters, followed by a period. Save that roman numeral so I can use it later.

In the “Replace” box I should indicate what I wish to replace all the matched text with:

\r Chapter \2 \r\r

\r is a special character which means “carriage return.” Then notice that I have put \2. The \2 contains, as I mentioned above, whatever was found within the second set of parentheses of my search term, in this case, each chapter number. If you run this regex on Mr. Legge’s text file, you should get nice chapter number headings with at least one empty line before it and two lines after it.

As seen in this example, a regex is often incredibly helpful when there is a consistent pattern to be found across one or more large text files which you want to replace with something else. It would have been a long night for Mr. Legge if he had to manually format all those chapter headings. To learn more about regex, which I argued last week will likely become part of basic digital literacy in the future, you can read more here. Once you are up and running, you may find this handy cheat sheet useful.

Next week I’ll see if we can use regular expressions to identify the sections in the sagas of the Heimskringla which mention women.

Photo by Jake von Slatt at the Creative Commons licensed SteamPunk Workshop

Return to Top