Tuesday, August 30, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (Part 1)

Luiz André Barroso and Urs Hölzle Google Inc. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.
The emergence of popular Internet services and availability of high-speed connectivity have accelerated a trend towards Server Side or Cloud Computing. This paradigm is primarily driven by ease of management, ubiquity of access, faster application development and exploiting economies of scale in a shared environment which ultimately allow many application services to run at a low cost per user. The trend towards server-side computing has created a new class of computing systems which the authors named as warehouse-scale computers (WSCs). These warehouse-scale computers operate at a massive scale of tens of thousands of machines and power the services offered by companies like Google, Microsoft, Amazon, Yahoo etc. However, they differ from traditional datacenters as they are generally owned by a single organization, use a relatively homogenous hardware and software platform and share a common systems management layer.

In the first chapter, the authors highlight the tradeoffs and architectural decisions made by Google while setting up their fleet of WSCs in terms of Storage (GFS instead of NAS), Networking Fabric (spending more on fabric interconnect by building "fat tree" Clos networks as opposed investing in high end switches), Storage Hierarchy (exploiting L1/L2 caches and setting up a Local/Rack/Datacenter DRAM and Disk hierarchy and exposing it to application designers) and Power Usage (taking steps towards power-proportional clusters).

The second chapter gave us a sneak preview into a variety of Google's query workloads: Index Generation, Search and MapReduce based similarity detection in scholarly articles. Due to the nature of these workloads, the entire software stack was designed to achieve high performance and availability (by replication, sharding, load balancing), sometimes at the cost of strong consistency. Overall, the stack was divided into 3 layers:
  1. Platform-level Software: The firmware/kernel running on individual machines.
  2. Cluster-level Software: The distributed system software that manage resources/services on a cluster level. (eg. MapReduce, BigTable, Dryad etc.)
  3. Application-level Software: The software that implements a particular service like GMail, Search etc.
Finally, the authors highlighted the importance of debugging tools in datacenters and briefly compared black box monitoring tools (like WAP5, Sherlock) with Instrumentation-based tracing schemes (like X-trace). Due to a greater accuracy of the latter, Google developed Dapper, a lightweight annotation-based tracing tool.

No comments:

Post a Comment