Ever hear how they talk about this being the world of “big data”? Ever notice how those fresh young grad students in the social science and humanities departments are playing around with huge collections of digital stuff just sitting on their hard drives?
Where did they get it all from? Did they just press “Download Data” on some web site? Or get passed a USB drive with a ton of files on it? Or did they sit on some cool database and painstakingly copy and paste text, download PDFs page by page, or manually save images they came across?
Maybe. But some of them are apprentices in the Dark Arts. They can animate their computers to work like a floating broom on the internet, sweeping up information that is useful to them.
This sorcery I speak of is web scraping - the automated extraction of targeted content from websites. To many webmasters, digital archivists, and content providers, it is a wicked practice worthy only of rogues. To others, it is a powerful research tool to easily and quickly assemble a collection of material to study for patterns or sift through offline for items of value. As with all powerful tools, it must be handled with care.
Good scrapers require programming skills to develop, and are carefully tailored to the target. They don’t just “grab” files, but filter and “scrape” out the desired information. The more respectful ones work slowly, so as not to overwhelm a server, and selectively, downloading only what is needed. The most devious ones emulate the behavior of human browsing habits in order to conceal their identity when they scurry about behind paywalls. The least subtle are the applications which simply download everything on a whole domain.
Curl
The most basic tool in a web scraper’s toolbox does not require any programming skills and can be found on everyone’s command line (see Lincoln’s introduction to the command line here) on Mac OS X and Linux. I believe you can install it for Windows as well but Ms. Google will have to help you with that. Curl is basically a command that goes and grabs something from the internet for you.
Curl (and the popular alternative wget) is particularly handy when you want to save a range of things from the internet which have a URL with a consecutive number sequence. For example, let’s say I want to download a bunch of statistics about wages in Norway. On this page you can find 131 tables of data from the Norwegian government: NOS D 362: Wage Statistics 2005.
Now, I could open each of those 131 tables and save them one at a time, but I noticed that their URLs are found in a simple sequence. The first table file is called tab-001.html, the next called tab-002.html, and the third tab-003.html.
This is the kind of thing that curl is perfect for. Open up your terminal and in a single command we can grab all the tables and save them offline at once:
curl “http://www.ssb.no/english/subjects/06/05/nos_lonn_en/nos_d362_en/tab/tab-[001-131].html” -o "#1.html”
Notice that I put the range of numbers, including leading zeroes, inside of brackets [001-131]. I want to grab every file between those two numbers. The -o in the command tells curl that you want to save the resulting data grabbed from the URL into a file. The title of that file follows but notice that it includes #1. #1 is a placeholder that will be replaced by whatever the number was inside the brackets for that particular file. The first file will get the name 001.html, the second 002.html and so on.
This doesn’t have to be an HTML file. Let’s say I want to download a few dozen photos of fork lifts that I found thanks to a quick google search. Noticing that it has a simple scheme of sequential numbering, I can curl the photos in one command:
curl “http://forklift-photos.com.s3.amazonaws.com/[12-48].jpg” -o "#1.jpg”
If you want to slow your download down a bit and thus be less of a burden on the server you are downloading from, you can set a rate limit. For example, to download at a maximum rate of 10k per second you could do the same command as above with one small addition:
curl --limit-rate 10k “http://forklift-photos.com.s3.amazonaws.com/[12-48].jpg” -o "#1.jpg”
For more on the world of web scraping, I’d recommend learning a language like Ruby or Python, learn how to use web “parsers” for your language of choice that make it easy to extract pieces out the websites you grab, and get help from the flow of questions and answers on web scraping at Stack Overflow.
Image: Creative Commons licensed broom from Flickr user shikigami2011.