While text-based search services such as Google’s and Microsoft Bing now come pretty close to consistently serving up what users seek, video search services remain inexact at best, said video archiving experts who spoke on a panel at last week’s WWW2010 conference.
Yet the panelists agreed that video searching techniques must improve exponentially if people are to use the growing amount of video footage now stored on the Internet and elsewhere.
“If the material is searchable, it will be useful to the public,” said video archiving consultant Jackie Ubois, who moderated the panel during the conference held in Raleigh, North Carolina.
Hans Westerhof, director of the Images for the Future program for the Netherlands Institute for Sound and Vision, explained the urgency for developing better video search.
In 2005, the Institute started a program to digitize its vast video archive. About 280,000 hours of video and audio footage, including movies, television shows and news footage, will be digitized. About 100,000 hours of footage have already been converted, taking up 3 petabytes of storage space, and the archive is expected to grow to 14 petabytes by 2015.
The problem the Institute faces with all this video footage is making it easy to find. Many of the older source reels of film had little if any metadata, or descriptive data. Reels of old television programs, for instance, had just the barest amount of information, such as the title of the program and the date it was shown. No information was included about the content of the program.
“For the material to be useful, we need metadata,” he said. The act of creating metadata should be automated wherever possible. “Traditional cataloging does not work at this scale,” he said.
Right now, the Institute for Sound and Vision is looking at automated ways of extracting data from the video, using tools such as speech and image recognition.
But developing tools for automatically cataloging video is much harder than developing the tools used to tag text content for a variety of reasons.
Video, unlike text, can only be reduced to individual pixels, which offer no information about the video as a whole, said Paul Over, the project leader for a National Institute of Standards and Technology program to stimulate development of better video search. A block of text, on the other hand, can be reduced into a series of words, the definitions of which are known, and can be analyzed to give a greater summary of the whole document.
Video has “no correlate to the word,” he said, making video harder to catalog.
“Video is not easy. It is hard to extract the structure,” said Marko Grobelnik, the program manager for the VideoLectures.net service on online lectures. “We still struggle with basic problems like object recognition.”
Jamie Davidson, who is the product manager for search and algorithmic discovery at Google’s YouTube, noted that Google is also trying new algorithms to help bring some context to the videos that get uploaded to the site.
For instance, software can determine if the video is a common event, such as a music concert, in order to help identify the content. It can make note of where the video was uploaded from, to allow users to narrow their searches to specific geographic regions.
But YouTube still faces challenges of search and classification, especially given the whimsical nature of many of its videos. He showed a brief clip of a prairie dog turning dramatically around before the camera, with accompanying music. The title to this clip is “dramatic chipmunk,” which would be hard for someone seeking out the video to guess.
People may also look for videos for a wide variety of reasons, Over explained. For instance, a casual Web surfer may be looking for some entertaining clip. An intelligence analyst can be looking for background information and may not care why the video was taken. A documentarian or news organization may be looking for stock footage of a particular time and place. It would be hard to tag a video in such a way that it could be found by all these users.
As an example, Over showed a brief clip of a woman running across a plaza, frightening off a bunch of pigeons, before slipping and falling on the wet bricks.
“How might you tag this so it would be reusable?” he asked, before running down a list of obvious descriptions: “Woman, pigeons, plaza, daytime, outdoors, falling.” In fact, the person who uploaded the video tagged it only with the phrase “stupid sister.”
The tag “is very personal, and means something to that person, but it is not useful” for reuse by others, he said.
The NIST program sets out a series of challenges each year for improving the state of automated video search, using actual video footage as a test of success. The idea is to develop algorithms that can tag footage as equally well as humans, pinpointing people, objects, locations and even specific events within the video.
One approach was to build up a set of what he calls “recognizers,” or objects or events that could be recognized by algorithms. The software could ask, “Does this shot contain a classroom? Does it contain a chair? Is there singing going on?” and apply the specific tags. The more recognizers, the more robust the software would be in capturing valuable attributes of the footage.
Progress is being made: In previous years, the program used nightly news broadcasts and airport video surveillance footage as the data set to test new video search systems and technologies. This year, however, the program will use video footage from the Internet Archive, which offers a greater diversity of material.
“As particular approaches, or algorithms, get incorporated into different systems, they get tested on different data and [to] prove they work again and again,” Over said.
While such tools have come a long way in the past few years, there is much more work yet to be done to bring them to levels of commercial usability, the panel agreed. While seeing that tools may be useful eventually, “I don’t see us using them in the near future,” said the Institute for Sound and Vision’s Westerhof.