Many of the students and scholars I know who have picked up technical skills in the world of the command line (see Lincoln’s introduction and a series of posts here at Profhacker) or who have attempted their hand at programming come to what they know through tinkering. Some new way they want to analyze their sources, improve the discovery of interesting patterns, organize their stuff, or automate their tasks supplies them the justification they need to carve out some time to learn by playing. Tinkering leads to googling, googling leads into the world of obscure documentation, endless forum posts, and tutorials usually targeting a much different audience with differing needs. This adds significantly to the time it takes to figure things out.
One of the earliest and most consistent exceptions to this in the case of my own learning is found in the tutorials by William Turkel, especially through his blog entries and important work on the Programming Historian project. They not only introduce some really powerful utilities and coding snippets, but apply them immediately to the kinds of tasks we might find useful as historians and indeed the broader humanities.
2013 offered a particularly rich harvest of tutorial material by Turkel on his blog, especially contributing to what he calls a "workflow for digital research." Most of these help you obtain, clean, and analyze textual sources. As with most things technological, there are many different ways to perform most of the tasks listed below, but I found that these postings give great practical examples of some of the core techniques of using the command line for manipulating texts. I’d like to just highlight just some of them and suggest why you might want to give them a try.
Almost all of Turkel’s tutorials this year work from the command line. If you use a Mac with OS X, you already have access to a lot of command line utilities and many others you can find and install using Homebrew or the respective websites for the tool you want. This is not always the case, however, and for the "permuted term index" utility mentioned in one of the text analysis posting mentioned below I wasn’t able to find a way to get it for OS X (tips welcome). A solution to this problem and also for Windows users is to set up a virtual machine that runs a Linux distribution like Debian. Turkel’s posting goes through the whole process step by step and will get you up and running. Also see Lincoln’s posting here at Profhacker.
A virtual machine is also very handy to keep self-contained sandboxes when you want to tinker. The free VirtualBox software used here is very easy to use and if you participate in the ArchiveTeam Warrior program, you probably already know how it works. For those working with security sensitive materials, you can also easily keep a virtual machine and its files encrypted.
This is a great intro to some of the most useful command line utilities for very basic text analysis. Using an example from Project Gutenberg, this tutorial uses the command "wget" to download the file, shows you how to use "head" and "tail" to quickly see the beginning and end of large files, the use of the "sed" command to "crop" a header or footer, the "wc" command to get basic text statistics, "grep" to search the text for things you are interested in, the "tr" command to clean a text and prepare it for analysis by removing punctuation, capital letters, etc. and then the sort and uniq commands (covered in earlier Profhacker posts here and here) to get word frequencies.
This posting on pattern matching taught me some great trips on how to use the "grep" command when you have a handwritten document with difficult to read words that you can only make out a few letters from. It also shows you how to color matched patterns that you have searched for with "egrep" and how to use "fgrep" to isolate words in a text that are not found in the dictionary. This is handy when you are looking for unusual terms, proper nouns, or potential mistakes in Opical Character Recognition. The posting also shows you how to use the "ptx" (permuted term index) command, which I had never heard of, to quickly create a concordance from a text.
This posting is more advanced and requires some scripting. Turkel often uses the Python programming language in his earlier postings but in all of these postings he uses "BASH scripts" which are really just little sequences of regular commands you can issue on the command line (in the Bash shell) with some added flow control and logic to handle repetition etc.
In this posting Turkel uses "wget" to download a batch of files, the "split" command to split a large file into smaller ones and a simple web indexing package called "swish-e" to build an index from your source and make searching it easier.
Building on the last posting, we now work with the java-based Stanford Language Processing Software, and Turkel shows us how to find a list of potential people, places, and organizations in our text source.
This posting shows how you can use the free Tessecract OCR software on the command line using an example of some typed correspondence from the early 20th century. Another great section in this posting is how to do "fuzzy match" search of a text using tre-agrep (I had trouble getting this to work on OS X, so try it in the VirtualBox Linux install instead).
We have talked a bit about working with PDFs on the command line here before. See, for example, Lincoln’s post on fixing PDFs using pdftk. This post by Turkel offers an introduction to a broader range of command line utilities for PDF, including "xpdf", "pdftk", "pdfimages" and "pdftotext" for the extraction of text and images from PDFs and the creation of new PDFs with imagemagick’s "convert" tool.
Are there other great tutorials teaching useful but challenging techniques and tools that you have come across in the past year? Consider sharing them in the comments.Return to Top