Library of Congress Preserves Blogs

Blogs are being started and abandoned at volcanic rates. Nonetheless, bloggers are creating a massive chronicle of daily life, filled with stories and, of course, rants.

It's a potentially important record of our time for future generations -- one that the Library of Congress is interested in preserving. But as with other forms of digital data, the Washington-based library can't hope -- and, really, doesn't want -- to save all of the content being published in blogs, according to Laura Campbell, associate librarian for strategic initiatives.

Campbell, who received the 2007 EMC Information Leadership Award at this week's Computerworld Honors Program ceremony, also is director of the National Digital Library Program. Through that and other programs, the Library of Congress is working to collect and preserve so-called "born digital" data that originates on the Web and to digitize other information for online access, particularly in educational settings.

The library currently is managing about 295TB of digital data, not all of it taken off of the Internet. Campbell, who is in charge of developing strategies for preserving the electronic data, said the library and its partners, including 12 other national libraries, are deliberately selective about what they choose to capture and store. "I don't think you would want to save most of what's produced," she said.

Personal blogs are part of the digital data mix in the library's collection. Campbell described that as a continuation of earlier data-collection efforts predating the Internet era. "We have the story of the common person at any time in history," she said, adding that the library also is collecting podcasts and information posted on social networking Web sites.

But there are limits to how much online data can be archived. "We are doing a sampling and go get blogs on certain subject areas that we have chosen and selected," Campbell said. "It won't be everything by any stretch."

Campbell said the library, which recently launched its own blog, has worked with its partners to develop software tools that can help automate the process of collecting material from the Internet.

As the collection work proceeds, improvements in the process are ongoing as well. "We're learning by doing," said Campbell, who described the current approach used by library workers as an iterative process of continuing assessments and adjustments.

This story, "Library of Congress Preserves Blogs" was originally published by Computerworld.

To comment on this article and other PCWorld content, visit our Facebook page or our Twitter feed.
Shop Tech Products at Amazon