Investigating Infrastructure Services
The infrastructure services are composed of the Elastic Computing Cloud (EC2); Simple Storage Service (S3), a persistent storage system; the Simple Database (SimpleDB), which implements a remotely accessible database; and Amazon's Simple Queuing Service (SQS), a message queue service and the agent for binding distributed applications formed from the combination of EC2, S3, and SimpleDB.
These services provide virtually limitless compute, storage, and communication facilities. They're ideally suited for what might be called "intermittent" applications: those that require substantial compute or storage facilities on an irregular basis (for example, an application that wakes up Friday evening to process data gathered during the week). An application that requires worldwide connectivity -- say, a system that processes graphics files and makes the results available to clients across the Internet -- can also make good use of infrastructure services. Finally, these services act as excellent proof-of-concept laboratories for large-scale distributed applications. A development house seeking to demonstrate the feasibility of a proposed enterprise-wide application can implement a prototype using the infrastructure services, and avoid hardware costs that, if the prototype is deemed unworkable, would be a net loss.
Elastic Computing Cloud (EC2). Imagine a vast room filled with server systems, all networked together. Sitting at your single workstation, you create a virtual machine image that defines a 1.2GHz processor running Linux with 1.7GB of RAM and a 160GB hard disk, pre-loaded with software you have crafted specifically to number-crunch a large matrix of mined data. You deploy this image to an outside service, which manages those servers. At some future point, a boatload of matrices arrives from your data-mining operations. You instruct the service to instantiate 50 of your virtual machines, and turn each loose on one of the data matrices. Within a few seconds, 50 of those 1.2GHz processors are active and chomping on your data. They finish, deposit their results at a pre-specified storage site, and disappear.
That's EC2 in a nutshell. It's nothing less than a boundless collection of virtual computers that a user can call into existence to perform some processing task. "Boundless," however, does not mean "infinite"; rather, there is no specific upper limit -- other than your wallet. Amazon's documentation states that you can commission "hundreds, or even thousands" of virtual machines simultaneously.
[How does EC2 stack up against its rivals? Read the Test Center's hands-on review "Cloud versus cloud: A guided tour of Amazon, Google, AppNexus, and GoGrid."]
Because systems in EC2 are virtual, Amazon provides a range of hardware capabilities. At the low end, you can call for a 1.26GHz Opteron-class machine with 1.8GB of RAM. At the high end (at the time of this writing), you can request a 64-bit multicore system with 15GB of RAM. These specifications are approximations. Virtual machines that you instantiate are rated in EC2 Computer Units (ECUs), which Amazon defines as being equivalent to a 1.0GHz to 1.2GHz 2007 Opteron processor. (The company suggests you do your own benchmarking to determine which instance is best for your particular application.)
An Amazon Machine Image (AMI) consists of an operating system and whatever applications you want pre-loaded when the virtual machine is started. Currently, only Linux is available as an EC2 instance's OS, though this is hardly a limitation. There are quite a few distributions in Amazon's catalog of prebuilt AMIs. Perusing the list, I found ready-to-use AMIs for Ubuntu, OpenSolaris, Centos, Fedora, and many others -- all told, more than 100 AMIs ready to go. You can build your own AMI using a free Amazon-provided SDK, but the process is lengthy. It is far easier to select a prebuilt AMI from the catalog, and customize it as necessary. Even so, many available AMIs include software for specific applications; you may well find one that already has much of what you need.
Simple Storage Service (S3). Amazon's Simple Storage Service (S3) is effectively a large disk drive in the ether. Strictly speaking, that's 90 percent of everything you need to know about it. It has no directories and no file names -- just a big place where you can store and fetch unstructured data in gobs as small as 1 byte or as big as 5GB.
What I call a "gob," S3 calls an "object," and in place of "directory," S3 says "bucket." So when you store a 200KB JPEG on S3, you're putting a 200KB object in a bucket. A given AWS account can own up to 100 buckets. A bucket can hold an unlimited number of gobs, and it can be configured to reside either in the United States or Europe. Presumably, this provides users a comforting feeling of locality, because buckets are available anywhere on the Internet that Amazon is accessible. Cost differences between the two are tiny; a bucket in Europe will run you something like two-thousandths of a cent more per 1,000 requests than in the United States.
Digging a bit deeper, you can think of an object as a three-in-one entity: key, value, and metadata. The key is the object's name, value is its content, and metadata is an array of key/value pairs carrying information about the object. (Access permissions are also associated with an object, but are treated as separate from object storage.) An object's name can be between 3 and 255 characters, and the only constraint that Amazon places on names is that they not confuse URL parsing. Thus, an object with a name of "192.168.12.12" is a bad idea.
Whereas the architecture of S3 is effectively a flat file system, S3's APIs permit a clever programmer to build apparent subdirectories within a bucket. The hierarchies have to be encoded in the object names, which is less than ideal; however, it's an artifact that code could simply mask. So, if you want one directory of animals and another of vegetables, you might have object keys such as "animal-cat", "animal-dog," "vegetable-beet," and "vegetable-carrot." Using the prefix parameter of the List operation, you can restrict retrieved object keys to only animals or only vegetables. More complicated data structures should be kept in Amazon's Simple Database.
Amazon Simple Database Service (SimpleDB). While Amazon S3 is designed for large, unstructured blocks of data, SimpleDB is built for complex, structured data. As with the other services, the name says it all. SimpleDB implements a database that sits behind a lightweight, easily mastered query language that nonetheless supports most of the database operations (searching, fetching, inserting, and deleting) you'll likely need. In keeping SimpleDB simple, Amazon has followed the principle that the best APIs are those with minimal entry points: I count seven for SimpleDB.
A SimpleDB database is not exactly like a relational database of the Oracle or MySQL sort. (Amazon's documentation points out that, if you do need a full-blown relational database, you are free to run a MySQL server on an AMI in the elastic compute cloud.) A SimpleDB database (a "domain" in SimpleDB parlance) is composed of items, and items are composed of attributes. An attribute is a name/value pair. At a minimum, an item must have an ItemName attribute, which serves as the item's unique identifier. When you issue a query, the result is a collection of ItemName values -- to fetch the actual content of the item (the attributes), you perform a Get operation using those values as input.
As simple as it is, SimpleDB packs surprising capabilities. A SimpleDB database can grow up to 10GB and house up to 250 million attributes. You can define up to 256 attributes for a given item, and there is no requirement that all the items in a domain have the same attributes. In addition, a given attribute can have multiple values, so a customer database could store multiple aliases in a single CustomerName attribute.
Finally, SimpleDB is designed to support "real-time" (fast turnaround) queries. To ensure quick query response, all attributes are indexed automatically as items are placed in the database. Also, Amazon's documentation indicates that a query should take no more than 5 seconds to complete; otherwise, it will likely time out. Amazon does this to ensure that a query receives a quick response, even if a query is malformed to the degree that it would hamper the calling application.
Amazon Simple Queue Service (SQS). Amazon SQS is a message queuing service in the vein of JMS or MQSeries -- only simpler. SQS's most impressive characteristic is its ubiquity. A blurb from Amazon's documentation reads: "Any computer on the Internet can add or read messages without any installed software or special firewall configurations." The most likely participants in SQS message transactions are, of course, instantiated AMIs in the EC2.
As with other Amazon Web services, SQS earns its name: Messages are text-only, and must be less than 8KB in length. You can build a working queue with only four functions: CreateQueue, SendMessage, ReceiveMessage, and DeleteMessage. (There are other convenience functions; ListQueues, for example, will list an account's existing queues.)
SQS queues are designed primarily to support workflows among distributed computer systems, and as such, concurrency management and fail-over are implicit. When a client reads a message from a queue, that message is not deleted; it is simply locked in such a fashion that it becomes invisible to other clients. In that way, if the message represents a specific task to perform as part of a workflow, two clients cannot read the same message and, thereby, duplicate effort. However, if the message is not deleted before a specified timeout, the lock is released. The intent, then, is for the original reader of the message to delete it when the specified work is complete. If the original reader is unable to complete the work (perhaps on account of a system crash), the timeout expires, the message "reappears" in the queue, and a different client can read the message and undertake the specified work.