> Skip to content
FEATURED:
  • The Evolution of Race in Admissions
Sign In
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
Sign In
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
Sign In
ADVERTISEMENT
Profhacker Logo

ProfHacker: Why Not Spare a Little Bandwidth for the Archive Team?

Teaching, tech, and productivity.

  • Twitter
  • LinkedIn
  • Show more sharing options
Share
  • Twitter
  • LinkedIn
  • Facebook
  • Email
  • Copy Link URLCopied!
  • Print

Why Not Spare a Little Bandwidth for the Archive Team?

By  Konrad M. Lawson
February 3, 2014
archiveteam-warrior-infrastructure.png

Do you remember Seti@home? It was the first really widely loved example of a crowdsourced distributed computing project. You install a cool looking screensaver and in the background your computer crunches data on behalf of the noble cause of finding aliens in space. There are now many

We’re sorry. Something went wrong.

We are unable to fully display the content of this page.

The most likely cause of this is a content blocker on your computer or network. Please make sure your computer, VPN, or network allows javascript and allows content to be delivered from c950.chronicle.com and chronicle.blueconic.net.

Once javascript and access to those URLs are allowed, please refresh this page. You may then be asked to log in, create an account if you don't already have one, or subscribe.

If you continue to experience issues, contact us at 202-466-1032 or help@chronicle.com

archiveteam-warrior-infrastructure.png

Do you remember Seti@home? It was the first really widely loved example of a crowdsourced distributed computing project. You install a cool looking screensaver and in the background your computer crunches data on behalf of the noble cause of finding aliens in space. There are now many projects which take advantage of large networks of home computers to carry out tasks. The use of distributed computing for the “mining” in the virtual currency Bitcoin is another recent example from the news.

The distributed computing project that is perhaps closest to my heart these days is the Archive Team Warrior project with Jason Scott as their spokesman, which helps archive the public content of large web services before they are buried in their digital graves. Their first great coup was in 2010, when they released a torrent file to download GeoCities, where a good chunk of the internet resided in the early days.

I first found about their activities while working on the Digital Archive of Japan’s 2011 Disasters, which included a large-scale web archiving element carried out in cooperation with the Internet Archive. It is hard to realize how much of the open web goes down in just a few months of time, and watching this process unfold in closeup after the 2011 disasters in Japan made me realize how monumental the challenge will be for historians in the future to capture some of the quieter corners of the net that constitute a particularly unique and rich heritage, especially when it comes to small scale projects and local communities in particular.

ADVERTISEMENT

It is tempting to dismiss activities like this archiving, both because some of the projects Archive Team preserves, like GeoCities, was probably home to most of the ugliest web pages and most ridiculous content of its time on the open web, but also because we might well argue that not everything should or needs to get saved (one should also mention the evolving debates on proposed and existing rights to be forgotten). There is an antiquarian attraction for this kind of activity that might be most tantalizing to the obsessive collector. Even so, my work as project manager for Jdarchive.org also showed me that not only is there huge value for historians in web archiving, but most of us are completely without power when services that we entrust content for public sharing (important we distinguish this from our confidential and private content!) - and this is increasingly the case as the web evolves from static web pages that can be easily captured to content that is delivered dynamically through complex platforms. There is no turning back when a company has decided to close its doors, take down and delete all of its content.

For simple websites or small groups of websites, working with the Internet Archive, Archive-it, or running your own web scraper are all options. However, large services are worth handling in a more methodical way. When a major service with a lot of public facing content announces that it is going down like Geocities, MobileMe, .Mac, Posterous, Yahoo Blogs, Google Reader (its historical feed data, directory, and statistics), coders that are part of the Archive Team develop scripts to facilitate the crowdsourced scraping of their public content. It also throttles downloads of the material to limit overloading the dying service.

The way we get can involved is to install Virtual Box, mentioned in the first tutorial by William Turkel I introduced last week, and use it to install the “Archive Team Warrior” (a custom installation of Debian Linux). This allows the Archive Team to distribute web scraping tasks to your computer while it is running. To control its behavior, a simple local web interface is made available to you (accessible at http://localhost:8001/ when Archive Team Warrior is running) where you can choose the scraping project you want to be a part of and shows you exactly when and how much it is downloading and processing. At the time of writing this posting, it seems the web services currently active in the Archive Team Warrior software have already been shut down, but there are plenty of services on the “Deathwatch” that the Archive Team is keeping their eye on in the coming years.

Have you ever participating in a distributed computing project? Have you had the experience of losing access to content before you could save anything when a company went down? Please share in the comments.

ADVERTISEMENT
ADVERTISEMENT
  • Explore
    • Get Newsletters
    • Letters
    • Free Reports and Guides
    • Blogs
    • Virtual Events
    • Chronicle Store
    • Find a Job
    Explore
    • Get Newsletters
    • Letters
    • Free Reports and Guides
    • Blogs
    • Virtual Events
    • Chronicle Store
    • Find a Job
  • The Chronicle
    • About Us
    • DEI Commitment Statement
    • Write for Us
    • Talk to Us
    • Work at The Chronicle
    • User Agreement
    • Privacy Policy
    • California Privacy Policy
    • Site Map
    • Accessibility Statement
    The Chronicle
    • About Us
    • DEI Commitment Statement
    • Write for Us
    • Talk to Us
    • Work at The Chronicle
    • User Agreement
    • Privacy Policy
    • California Privacy Policy
    • Site Map
    • Accessibility Statement
  • Customer Assistance
    • Contact Us
    • Advertise With Us
    • Post a Job
    • Advertising Terms and Conditions
    • Reprints & Permissions
    • Do Not Sell My Personal Information
    Customer Assistance
    • Contact Us
    • Advertise With Us
    • Post a Job
    • Advertising Terms and Conditions
    • Reprints & Permissions
    • Do Not Sell My Personal Information
  • Subscribe
    • Individual Subscriptions
    • Institutional Subscriptions
    • Subscription & Account FAQ
    • Manage Newsletters
    • Manage Your Account
    Subscribe
    • Individual Subscriptions
    • Institutional Subscriptions
    • Subscription & Account FAQ
    • Manage Newsletters
    • Manage Your Account
1255 23rd Street, N.W. Washington, D.C. 20037
© 2023 The Chronicle of Higher Education
  • twitter
  • instagram
  • youtube
  • facebook
  • linkedin