Diffbot is trying to reorganize all the data on the Web so it can be put to better use.
The service “converts the existing Web into a structured database-like representation that can essentially be used for all sorts of intelligent applications,” said Mike Tung, Diffbot CEO.
On Thursday, Diffbot said it had received $500,000 in funding from Bloomberg Beta, the investment arm of the Bloomberg media company. Andy Bechtolsheim, a founder of Sun MIcrosystems and the first major investor in Google, is also a backer. Diffbot says it already has paying customers for the service, which is being used by Microsoft’s Bing, Adobe, Salesforce.com, and eBay.
The service creates an object for each Web page it finds. An object provides structure to a set of related data so that it can be programmatically reused, along with other similar objects, by a query engine or an external application. The software has been copying all the pages it finds on the Web and reorganizing them into objects.
Perhaps the most well-known example of this object-based approach is Google’s Knowledge Graph, a Semantic Web project. If a search is done on a particular keyword, such as the name “Johnny Depp,” Google will return, along with a standard list of Web pages, a box containing basic information on the actor, such as birth date and height. That box of information is a rendering of the “Johnny Depp” Knowledge Graph object built by Google.
Diffbot, which is based in Palo Alto, California, and was founded in 2008, claims its own collection of objects is superior to Google’s.
The 14-person company says it has created an entirely automated system for accurately creating objects. Google’s approach is at least partly manual, requiring individuals to edit objects after they have been created, confirmed a Google spokesman.
Google’s Knowledge Graph is larger than Diffbot’s, containing roughly a billion objects, while Diffbot’s global index of the Web now includes 600 million objects. But Google doesn’t yet offer a Knowledge Graph API for third-party commercial use, though it is working on one.
Diffbot is based on the idea that businesses could use such a collection of organized information for their own purposes. Nike, for instance, could deploy the service to build a profile of other shoe companies and their offerings, Tung suggested. DiffBot offers a set of APIs (application programming interfaces) that third-party applications can use to query the massive object set.
The company has developed a set of AI algorithms that can identify the context and subject of Web pages, some of which the company is in the process of patenting. One novel AI algorithm relies computer vision, which is not a widely used technique for indexing Web pages, Tung acknowledged. The layout and design of Web pages can provide important clues to help better define objects. “The layout is the signal that helps us determine what kind of page it is,” Tung said. An e-commerce site has an entirely different structure than a news site, for instance.
Diffbot is one of a number of companies building such “knowledge graphs,” through various sets of technologies, said Dave Schubmehl, an IDC research director who covers content analytics, discovery and cognitive systems. Such technology could be of potential value to any business that relies on understanding large amounts of external data, he said via email.
Another company working in this field is IBM, Schubmehl wrote. Last year, IBM purchased two companies to install similar capabilities in its Watson cognitive computing service. One was AlchemyAPI, which builds taxonomies of data assets, and the other is Blekko, which developed software for indexing Web sites.
Some organizations use other technologies to organize and synthesize large sets of otherwise unstructured information, according to Schubmehl. Neo4J and Oracle both offer graph databases, which are well-suited for identifying the connections across large collections of data. Others rely on semantic Web standards, such as the Sesame Java Framework, which is used for converting data into the structured RDF (Rich Description Framework) format.