Why Not Spare a Little Bandwidth for the Archive Team?

Do you remember Seti@home? It was the first really widely loved example of a crowdsourced distributed computing project. You install a cool looking screensaver and in the background your computer crunches data on behalf of the noble cause of finding aliens in space. There are now many projects which take advantage of large networks of home computers to carry out tasks. The use of distributed computing for the “mining” in the virtual currency Bitcoin is another recent example from the news.

The distributed computing project that is perhaps closest to my heart these days is the Archive Team Warrior project with Jason Scott as their spokesman, which helps archive the public content of large web services before they are buried in their digital graves. Their first great coup was in 2010, when they released a torrent file to download GeoCities, where a good chunk of the internet resided in the early days.

I first found about their activities while working on the Digital Archive of Japan’s 2011 Disasters, which included a large-scale web archiving element carried out in cooperation with the Internet Archive. It is hard to realize how much of the open web goes down in just a few months of time, and watching this process unfold in closeup after the 2011 disasters in Japan made me realize how monumental the challenge will be for historians in the future to capture some of the quieter corners of the net that constitute a particularly unique and rich heritage, especially when it comes to small scale projects and local communities in particular.

It is tempting to dismiss activities like this archiving, both because some of the projects Archive Team preserves, like GeoCities, was probably home to most of the ugliest web pages and most ridiculous content of its time on the open web, but also because we might well argue that not everything should or needs to get saved (one should also mention the evolving debates on proposed and existing rights to be forgotten). There is an antiquarian attraction for this kind of activity that might be most tantalizing to the obsessive collector. Even so, my work as project manager for also showed me that not only is there huge value for historians in web archiving, but most of us are completely without power when services that we entrust content for public sharing (important we distinguish this from our confidential and private content!) – and this is increasingly the case as the web evolves from static web pages that can be easily captured to content that is delivered dynamically through complex platforms. There is no turning back when a company has decided to close its doors, take down and delete all of its content.

For simple websites or small groups of websites, working with the Internet Archive, Archive-it, or running your own web scraper are all options. However, large services are worth handling in a more methodical way. When a major service with a lot of public facing content announces that it is going down like Geocities, MobileMe, .Mac, Posterous, Yahoo Blogs, Google Reader (its historical feed data, directory, and statistics), coders that are part of the Archive Team develop scripts to facilitate the crowdsourced scraping of their public content. It also throttles downloads of the material to limit overloading the dying service.

The way we get can involved is to install Virtual Box, mentioned in the first tutorial by William Turkel I introduced last week, and use it to install the “Archive Team Warrior” (a custom installation of Debian Linux). This allows the Archive Team to distribute web scraping tasks to your computer while it is running. To control its behavior, a simple local web interface is made available to you (accessible at http://localhost:8001/ when Archive Team Warrior is running) where you can choose the scraping project you want to be a part of and shows you exactly when and how much it is downloading and processing. At the time of writing this posting, it seems the web services currently active in the Archive Team Warrior software have already been shut down, but there are plenty of services on the “Deathwatch” that the Archive Team is keeping their eye on in the coming years.

Have you ever participating in a distributed computing project? Have you had the experience of losing access to content before you could save anything when a company went down? Please share in the comments.

Return to Top