Tuesday, August 30, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (Part 1)

Luiz André Barroso and Urs Hölzle Google Inc. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.
The emergence of popular Internet services and availability of high-speed connectivity have accelerated a trend towards Server Side or Cloud Computing. This paradigm is primarily driven by ease of management, ubiquity of access, faster application development and exploiting economies of scale in a shared environment which ultimately allow many application services to run at a low cost per user. The trend towards server-side computing has created a new class of computing systems which the authors named as warehouse-scale computers (WSCs). These warehouse-scale computers operate at a massive scale of tens of thousands of machines and power the services offered by companies like Google, Microsoft, Amazon, Yahoo etc. However, they differ from traditional datacenters as they are generally owned by a single organization, use a relatively homogenous hardware and software platform and share a common systems management layer.

In the first chapter, the authors highlight the tradeoffs and architectural decisions made by Google while setting up their fleet of WSCs in terms of Storage (GFS instead of NAS), Networking Fabric (spending more on fabric interconnect by building "fat tree" Clos networks as opposed investing in high end switches), Storage Hierarchy (exploiting L1/L2 caches and setting up a Local/Rack/Datacenter DRAM and Disk hierarchy and exposing it to application designers) and Power Usage (taking steps towards power-proportional clusters).

The second chapter gave us a sneak preview into a variety of Google's query workloads: Index Generation, Search and MapReduce based similarity detection in scholarly articles. Due to the nature of these workloads, the entire software stack was designed to achieve high performance and availability (by replication, sharding, load balancing), sometimes at the cost of strong consistency. Overall, the stack was divided into 3 layers:
  1. Platform-level Software: The firmware/kernel running on individual machines.
  2. Cluster-level Software: The distributed system software that manage resources/services on a cluster level. (eg. MapReduce, BigTable, Dryad etc.)
  3. Application-level Software: The software that implements a particular service like GMail, Search etc.
Finally, the authors highlighted the importance of debugging tools in datacenters and briefly compared black box monitoring tools (like WAP5, Sherlock) with Instrumentation-based tracing schemes (like X-trace). Due to a greater accuracy of the latter, Google developed Dapper, a lightweight annotation-based tracing tool.

Above the Clouds: A Berkeley View of Cloud Computing

Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia. Above the Clouds: A Berkeley View of Cloud Computing.
The authors identified three properties that gives Cloud Computing its appeal: short-term usage (which allows for scaling up/down resources based on demand), no up-front cost, and (an illusion of) infinite capacity on-demand. Due to the inherent elasticity and flexibility that cloud computing provides, any application that needs a model of computation, storage and/or communication fits perfectly in this paradigm.

A key decision for any application provider is to whether invest in the cloud on a pay-per-use basis or incur a one-time cost of setting up the whole datacenter. This paper highlighted the reasoning and rationale that should be put in before making those decisions in terms of long term economical benefits and being risk averse in case of load spikes.

Finally, the paper highlighted some key obstacles to the growth of cloud computing, and paired each of them with opportunities that may result:
  1. Availability of the Service: Prevention against DDoS attacks, large scale failures.
  2. Data Lock-In: Losing customer data due to dependence on a variety of service providers
  3. Data Confidentiality: Making sure that the data is secure.
  4. Data Transfer Bottlenecks: Tranferring user data to the datacenter economically.
  5. Performance Unpredictability: Achieving disk I/O performance isolation even though the resources are shared.
  6. Scalable Storage: Making sure that the store achieves scalability, data durability and high availability even as the size of data increases.
  7. Distributed Debugging: Identifying bugs in Distributed Systems.
  8. Scaling Quickly
  9. Reputation Fate Sharing: Identifying adversaries, spammers.
  10. Software Licensing: Solving challenges that result due to proprietary software licensing issues.