The 'Archive Team' Rescues User Content From Doomed Sites

What happens when your favorite Web host decides to go out of business and ice the content from thousands of users like you? Does all of that data just disappear, never to be seen again? It can happen, easily.

Under current law, a cloud or hosting site has a pretty much unlimited right to decide whether the content that people put on its pages remains available or vanishes. And a site that chooses to delete content is under no obligation to preserve the data or even to give the data's contributors advanced notice of when the purging will occur.

A typical hosting site's relevant terms of service make clear the host's claimed freedom from liability and its implied right to act unilaterally: "Your use of the [hosting site's] service is at your sole risk. [The hosting site] is not responsible for any and all files and data residing on your account on our servers. [The hosting site] does not maintain backup copies of customers web sites or e-mail. [The hosting site] cannot guarantee that the contents of a web site will never be deleted or corrupted, or that a backup of a web site will always be available. You agree to take full and sole responsibility for any and all files and data transferred to our servers and to maintain all appropriate backups of any any and all files and data stored on any [hosting site] server to which you have an account on."

The freedom of Web hosts to eradicate information bothers Jason Scott a lot, and he says it's why he formed the Archive Team, a "loose, rogue band of data preservation activists." The Archive Team looks for hosting sites that are about to go down--like Apple's MobileMe right now--and then makes a furious, coordinated effort to rescue the data before it disappears into the ether.

About 250 people have been part of the Archive Team since its inception in 2009. Some members methodically download content from the target site, sometimes writing original scripts to do so. Others donate or locate available servers on which to store the rescued data.

We Trust No One

The Archive Team's efforts have a "hacker" vibe. "Archive Team's unofficial slogan is 'we trust no one,'" Scott says. "What this means is that the group wants to rescue as much endangered Web content as possible, then host or mirror it on as many Web servers as possible."

And they're not always on their best behavior in going about doing their work. From Scott's website: "To do this, they have been rude, crude and far outside the spectrum of polite requests to save digital history, and have used a variety of techniques to retrieve and extract data that might have otherwise been unreachable."

Most significantly, the Archive Team has gotten results. For instance, after Archive Team rescued a terabyte of data from GeoCities after Yahoo announced that it was shutting the hosting service down in 2009, the Team made a torrent file of the data and put it up on Pirate Bay. Now scores of Pirate Bay users host the data.

The GeoCities Mission

The GeoCities project may be the Team's biggest caper. GeoCities opened for business way back in 1995, and invited people make their own Web pages, which it hosted. When Yahoo bought GeoCities for billions of dollars in January 1999, GeoCities was the second or third most visited site on the Web, depending on whom you asked. When Yahoo shuttered the U.S. branch of GeoCities on October 26, 2009 (GeoCities' Japanese branch remains open, according to Wikipedia), the site was still the 218th most visited destination on the Internet. Obviously a lot of people were storing and tending their content at the site.

But Yahoo at the time was losing its taste for the user-generated content business, so it quietly announced that it was pulling the plug. Enter the Archive Team. Altogether it took 100 people roughly six months to download all of the site's content, but they got it done.

Initially. the Yahoo servers permitted Archive Team members to download only about 12 megabits of content per hour, Scott says. So the team checked to see whether the site was holding Google's bots to that same limit--and found that it wasn't. "So we all changed all our agents' names to 'not-the-googlebot,' and then they could get data out as fast as they wanted," Scott says.

Saving Digital History

GeoCities serves as a useful case study of why saving digital content from extinction is important. Scott says that the Internet has a "digital history" that should be stored for future study, and not simply discarded. "Yahoo found a way to destroy the most amount of history in the shortest amount of time," Scott said at the time.

Scott says there are practical reasons for preserving the data: GeoCities pages contained an enormous amount of user data, some of which could be valuable. Scott gives as an example a GeoCities user who was also an expert on a certain kind of '90s computer, and who stored all of his knowledge about that computer at his GeoCities site. That knowledge might not exist anywhere else, in which case it would simply have ceased to exist if the GeoCities sites had not been saved.

There's a personal aspect to preservation, too. Take the case of a mother who used her GeoCities page to create an online shrine to her son who was born in 1981 and died in 1983. How do you replace an online memorial when an unsympathetic host unceremoniously zaps it from the server? Details of the child's life and the parent's memories not recorded elsewhere would have been permanently erased.

Successes

GeoCities was just the start for the Archive Team. Scott and his people began getting word of other sites that were on the verge of closing and were threatening to take user data with them.

  • When Yahoo Video announced that it would be shutting down, the team captured and archived all 10TB of video on the site.
  • After the Italian Web host Splinder said that it would go dark last November, the team came in and rescued 1.3 million user accounts.
  • When Friendster decided in 2011 to shut down and delete user data, the team mobilized quickly and managed to capture about 20 million user accounts, amounting to about 14TB of data.
  • When FortuneCity said it would pull the plug on April 30 of this year, Scott and company sprang into action; they've downloaded roughly 1TB of data from the site.

The list goes on.

The Archive Team has sometimes succeeded in promoting preservation simply by getting people to think and talk about the harm that data deletion can cause. Case in point: Google's decision to shut down Google Video put thousands of user-uploaded videos in danger of deletion. But when the Archive Team began preparing for a massive download of the site's video holdings, Google caught wind of it; after reviewing the situation internally, the company reconsidered its plan and eventually reversed the decision to delete the video content. Instead, Google left the videos on its servers and gave the service's users the opportunity to convert their videos into YouTube videos.

"The idea of completely uncontrolled nontransparent hosting of user content really needs to come to an end," Scott says. "But until then we're duping stuff because the conversation otherwise ends."

Scott works for the Internet Archive, which archives movies, music, and software, and is also home to the Wayback Machine, which maintains an archive of Web pages going back to the earliest days of the Internet. Scott has been working to increase the size of the Internet Archive's shareware collection, which he says now includes roughly 1000 CDs of shareware.

The Internet Archive didn't have enough manpower to handle the kind of speculative data acquisitions that Scott wanted to do, so he pursues that activity as a side project. But Scott tells me that the Internet Archive is among the many organizations that have donated server space to host or mirror some of the data that Scott recovers from doomed websites.

Scott lives north of New York City, and splits his time between New York and San Francisco, where his employer Archive.org is located.

Subscribe to the Security Watch Newsletter

Comments