Business Software

A Deep Dive into Amazon Web Services

Wading into Web Information Services

Amazon's Web Information Services are essentially query interfaces into extensive databases generated by a mixture of Web crawlers and Web traffic monitors. Data-mining organization can tap into the crawler-produced data to sift through information that is as wide-ranging as the Web itself. The utility of Web traffic data is self-evident to any company or individual interested in user visitation trends to their sites -- as well as to related or competing sites.

AlexaWeb Search. Amazon's Alexa Web Search is the result of partnering between Amazon and Alexa, and it lets you query the information gathered by Alexa's Web crawler bots. The quantity of information available is difficult to gauge; Alexa has been crawling the Web for over a decade, and the Internet is in nonstop growth. Alexa's site says that, while its bots are working constantly, it takes about two months for a complete cycle through the Internet.

When Alexa adds a new Web site document to its database, it indexes about 50 attributes associated with that document. Attributes include the document's language, its Open Document Category, various parsed components of the URL, geographic location of the hosting server, and more. Also available is the document's text, the first 20KB of which is text-indexed. All this is available for searching.

Naturally, searches on such a large database can take time. The Alexa Web Search service is architected so that when you issue a search, the service returns a request ID. You use this ID to track the status of your search's progress. When the search is complete, results are stored in a (possibly gigantic) text file. The text file can be downloaded and "mined" locally.

Alexa Web Information Service (AWIS). The Alexa Web Information Service lets you dip into traffic data gathered by various Alexa tools deployed about the Internet. You can query information for a specific URL, such as site contact information, traffic statistics (going back five years), and more. You can also discover how many links are on a given page, how many URLs are embedded in JavaScript, or the more interesting statistic of how may other sites link to the target ("inward-pointing" links). You can also use AWIS to fetch a thumbnail image of a Web page, useful for displaying pop-ups in response to a cursor hovering over links.

The accuracy of Alexa's data is unclear. The Alexa Web site states that the "traffic data are based on the set of toolbars that use Alexa's data, which may not be representative of the global internet population." Meanwhile, an Amazon Web services representative informed me that Amazon "aggregate[s] data from multiple sources to give you a better indication of Web site popularity." In any case, the ability to scour the text content of whole swaths of the Internet makes the Alexa Web service a profitable vein for Web data spelunkers.

Ready for the Big Time?

Amazon's Web Services are at once exciting and troubling. The infrastructure services adopt a sort of "mercenary" model of hardware and software horsepower; in theory, you can employ as large an army of computing power as your pocketbook can withstand. All the services offer universal availability -- if your network connection can reach Amazon, it can reach AWS. These are two powerful isotopes for fueling large-scale, on-demand, software services.

On the other hand, however, some of the important components are still in beta. SimpleDB, in fact, was in limited beta and not accepting new users at the time of this writing. The description of "beta" is off-putting, as it implies an architecture whose foundation has not yet solidified. And this implication became hard reality when, in June, Amazon's S3 suffered a temporary power outage that affected such high-profile users as the New York Times, whose archives were crippled.

Furthermore, the long-term security of the entire AWS remains to be seen. We can only take Amazon's word that its systems guarantee isolation of one user's applications from another's. Put simply, AWS is only going to work if its users' trust in it is complete. A security breach of any sort would likely be a mortal wound.

Programmers and architects of distributed systems will find the infrastructure pages on the AWS site to be nothing short of a playground. You can spend hours perusing the documentation, tutorials, examples, and references to community-supplied tools and libraries.

The "cloud" services -- EC2, S3, SQS, and SimpleDB -- are certainly compelling. Real applications are being built atop these virtual technologies. Examples can be found at the Amazon Web Services Elastic Compute Cloud resources page.

Some of the AWS components are of questionable utility. In particular, Mechanical Turk seems to create a built-in incentive to cause tasks to be priced below what they otherwise would. However, even the Turk might be a case of a technology ahead of its time. As the ability to conduct business over the Net continues to improve, perhaps Mechanical Turk will also.

Whether the notion of Amazon's "rentable infrastructure" catches on is unknown. Its failure (should it fail) will not be for lack of information and tools. I will be eagerly prowling the AWS Web site and AWS-relevant blogs to see what creations arise from the enticing techno-tinker-toy set that AWS represents.

Subscribe to the Daily Downloads Newsletter

Comments