Wednesday, September 7, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (Part 2)

Luiz André Barroso and Urs Hölzle Google Inc. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Chapters 3,4,7
Server Hardware, Networking Fabric and Storage Hierarchy Components are the 3 key building blocks of Warehouse Scale Computers (WSCs). In these series of chapters, the authors explained the rationale behind using clusters of low-end servers (or higher-end personal computers) as the computational fabric for WSCs. Based on an analysis of TPC-C benchmark and assumption of a uniformly distributed global store, they showed that on one hand, the performance edge of cluster based on high-end server deteriorates quickly as the cluster size increases (due to communication costs), and on the other, using low-end embedded/PC-class components results in sustaining considerably higher costs for software development and additional networking fabric. Further, smaller servers may lead to low utilization (due to the inability of small 'bins' to fit multiple tasks) and makes embarrassingly parallel programs intrinsically less efficient because they may want to check for global information. Hence, building clusters of Low-end servers are a good middle ground.


The authors further presented an architectural view of the datacenter in terms of its power and cooling requirements. A key aspect that was highlighted was the redundancy that is inherent in this design. Since a bulk of the construction costs are proportional to the amount of power delivered and the amount of heat dissipated, Google is experimenting with a variety of solutions such as water-based free cooling (wherein ambient air is drawn past a flow of water, lowering its temperature), glycol-based radiators (using glycol instead of water to prevent freezing in cold places), in-rack cooling (adding an air-to-water exchanger as the back of the rack so that hot air exiting the rack is immediately cooled). An interesting trend would be to look out for the emergence of Container-Based Datacenters, which integrate heat exchange and power distribution in the container itself achieving extremely high energy efficiency.


Finally, the authors highlighted as to how failures and repairs are dealt with. One key aspect of WSCs is their size. Even with a steller mean time between failures (MTBF), of say 30 years (10000 days),  a 10000 node cluster will experience 1 failure per day. No amount of hardware reliability can completely obviate the need for overlying fault-tolerant infrastructural software. The only solution is to design the level-2 software with the assumption of failure being the default case. The paper then further describes the System Health Infrastructure at Google and presents interesting data about disk, DRAMs reliability and frequency of machine reboots.


Comments/Critiques
The cost model behind the rationale of using low-end servers as the building blocks of datacenter made two simplistic assumptions which may need not necessarily hold for all WSC workloads:
  1. Looking into the details of TPC-C benchmark (http://www.tpc.org/tpcc/detail.asp), it seems like it is a specialized OLTP benchmark which (in HP ProLiant's case) ran on Oracle Database 11g. The price/performance ratio for a strict ACID compliant database may not be comparable to loosely POSIX-compliant MapReduce/HDFS frameworks and other Interactive Query Systems which run on these WSCs. 
  2. The analysis is based on an assumption of a uniformly distributed global store, which again is highly workload dependent. There are a variety of workloads (MapReduce being one of them) in which accesses are not uniform across the store: some data blocks/data nodes are more popular than others. Other workloads (like Pregel) can potentially work on graphs which can be easily partitioned.
While a one-size-fits-all philosophy is definitely economical (for Google, Microsoft, Amazon etc.), I believe that low-end servers are not always an answer for all possible query workloads and access patterns in smaller clusters operated by a variety of other companies.

Another aspect that was missing was the analysis of the failure rate of network components which are also equally important (if not more). A faulty router that computes a wrong header checksums with a high probability can potentially cause far serious disruptions than a faulty node. These are in general even hard to identify and it would have been interesting to know how frequently they occurred.

No comments:

Post a Comment