Monday, September 26, 2011

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M Léon, Y. Li, A. Lloyd, V. Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. CIDR 2011 
This paper talks about Megastore, a storage framework developed and used at Google that guarentees NoSQL based scalability, high availability and fairly strong ACID based semantics. In my view, the prime focus of the paper was essentially to strike a balance between 3 inherent tradeoffs: Scalability, Wide Area Replication and ACID based semantics. While Brewer's CAP theorem only guarantees us any 2 out of Consistency, Availability and Partition Tolerance, Megastore successfully highlights that we can build real world systems that could strike a good balance between these 3 properties. The following points very briefly highlight the key contributions of this paper in each of the aforementioned areas:

  1. Scalability: Partitions data into entity groups (which can be thought of as a small database) and stores them in Bigtable.
  2. ACID Transactions: Maintains a write-ahead log per entity group to track all writes. Further, it performs two-phase commits or implements asynchronous queues to pass messages between entity groups.
  3. Wide-Area Replication: It used Paxos for replicating data (which is kind of cool). Further, it tweaked it for optimal latency. (On average, it incurs an overhead of 1 WAN RTT for writes and 0 WAN RTTs i.e. local reads).
 It also presents a schema language as well which allows applications to have full control on the query plan (indices, joins etc.) as well as placement of data.

Comments/Critiques:

As I highlighted in the class, the success of MegaStore depends on its wide applicability to a variety of workloads that could be cleanly divided between entity groups without incurring too much realtime intra-group communication overhead. While this does work with emails, blogs, maps etc., it is unclear to me if it data partitioning into entity groups would still be as trivial in (say) complex social graphs. Recent developments like Facebook Ticker or publishing realtime twitter feeds doesn't fit too well in the Megastore Entity-Group model.

Secondly, it is unclear to me if Megastore made the right tradeoffs by choosing consistency over performance. While the paper does try to paint a very rosy picture in terms of query latencies, a write latency of 100-400ms and read latency of 10ms is not negligible and hiding it requires considerable effort on part of developers.

Last but not the least, while building Megastore on their exisiting BigTable based infrastructure does make sense for Google in terms of better management, scalability and performance, it is a bit unclear to me if that is indeed a right way to go? What if Google had used MySQL servers for each set of entity groups? A general question might be: Are NoSQL DataStores the right building blocks for data parallel computing? Or would a collection of RDBMS's do the same job well?   

Monday, September 12, 2011

Solid State Devices (SSDs): PerformanceModeling and Analysis of Flash-based Storage Devices

H. Howie Huang, Shan Li, Alex Szalay, Andreas Terzi. Solid State Devices (SSDs): PerformanceModeling and Analysis of Flash-based Storage Devices. MSST 2011.

This paper models and analyzes the performance of flash-based Storage Devices (SSDs) with a Linear Regression Tree . Compares Intel and Samsung SSDs. They compare and analyze 3 SSDs: Intel X-25M, OCZ Apex and Samsung using a a couple of transactional and web search workloads.

Comments/Critiques:
  1. Future of SSDs: I think flash storage would definitely have a place in future WSCs. However, in the next 10 years, I am unsure whether they will be used as a caching layer between the DRAM and the Disks or will replace Disks themselves (like Disks replaced magnetic tapes). An interesting trend to look out for will be how price/GB of DRAMs, SSDs and Disks vary over time.
  2. MTBF of SSDs: An interesting analysis to look out for will be to find out the mean time between failures of SSDs. This in turn would affect the choice of their wide adoption in datacenters.
  3. Fast Reads; Slower writes: Luiz Barrosso mentioned in his talk that poor write performance is one of the key challenges in using SSDs as a storage device. It would be interesting to see if they will find a place as the infrastructure-of-choice for read heavy workloads.
  4. Rethinking Infrastructural Software Stack: If the SSDs indeed replace the disks, this would call for  a redesign of the underlying file system itself. Currently GFS/HDFS for instance heavily depend on sequential writes as opposed to random writes. While SSDs are better suited for the latter. 


Multicore CPUs: Amdahl's Law in the Multicore Era

Mark D. Hill, Michael R. Marty. Multicore CPUs: Amdahl's Law in the Multicore Era.
This paper analyzes the tradeoffs between adding more cores to a single chip Vs. increasing single threaded parallelism. It proposes three different alternatives of multicore chip design (Symmetric, Asymmetric and Dynamic) and discusses their tradeoffs, implications and impact on future research in chip architecture and software development. In general, Asymmetric and Dynamic multicores are better than symmetric ones because they can handle both parallel and serial components of a program equally well. Serial computation is generally performed in the bigger cores while the smaller cores are responsible for executing parallelizable code components.

Several points came to my mind while reading this article:

  1. Hierarchical clusters (or clusters of clusters): With 1000s of cores on a chip (as predicted by folks from Intel and Berkeley), might result in what could be termed as a cluster of cluster of cores with a 2-level infrastructural software stack. This may lead to a trend towards multicore Operating Systems for doing better resource arbitration, scheduling and isolation with a multicore chip.
  2. Trends towards GPU/CPUs as Asymmetric Multicores: A combination of a GPU and CPU in some sense can be thought of as an asymmetric multicore. It would be interesting to see how well the software stack can leverage the underlying hardware.
  3. Workload Dependent Clusters: It almost feels like that there should be a trend towards workload-dependent clusters. A one-size fits all philosophy might not hold true for different workloads which inherently differ in their serializable and parallelizable components.
  4. Latency Variation: If asymmetric and dynamic multicores are popularized as the building block of future clusters, the homogeneity arguments will no longer hold true. This might lead to increase in variance (and decrease predictability) of running time of programs. In this light, the new multicore software stack will have to be re-written in order to embrace heterogeneity.
A question that we must try to answer is that if we will see these 100+ multicore clusters in 10 years? Barrozo/Hozle's arguments in favor of using commodity servers were primarily concentrated on economies of scale. So, the answer to this question would, I believe, depend on how quickly these multicore processors are embraced by the mainstream consumer market.

Wednesday, September 7, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (Part 2)

Luiz André Barroso and Urs Hölzle Google Inc. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Chapters 3,4,7
Server Hardware, Networking Fabric and Storage Hierarchy Components are the 3 key building blocks of Warehouse Scale Computers (WSCs). In these series of chapters, the authors explained the rationale behind using clusters of low-end servers (or higher-end personal computers) as the computational fabric for WSCs. Based on an analysis of TPC-C benchmark and assumption of a uniformly distributed global store, they showed that on one hand, the performance edge of cluster based on high-end server deteriorates quickly as the cluster size increases (due to communication costs), and on the other, using low-end embedded/PC-class components results in sustaining considerably higher costs for software development and additional networking fabric. Further, smaller servers may lead to low utilization (due to the inability of small 'bins' to fit multiple tasks) and makes embarrassingly parallel programs intrinsically less efficient because they may want to check for global information. Hence, building clusters of Low-end servers are a good middle ground.


The authors further presented an architectural view of the datacenter in terms of its power and cooling requirements. A key aspect that was highlighted was the redundancy that is inherent in this design. Since a bulk of the construction costs are proportional to the amount of power delivered and the amount of heat dissipated, Google is experimenting with a variety of solutions such as water-based free cooling (wherein ambient air is drawn past a flow of water, lowering its temperature), glycol-based radiators (using glycol instead of water to prevent freezing in cold places), in-rack cooling (adding an air-to-water exchanger as the back of the rack so that hot air exiting the rack is immediately cooled). An interesting trend would be to look out for the emergence of Container-Based Datacenters, which integrate heat exchange and power distribution in the container itself achieving extremely high energy efficiency.


Finally, the authors highlighted as to how failures and repairs are dealt with. One key aspect of WSCs is their size. Even with a steller mean time between failures (MTBF), of say 30 years (10000 days),  a 10000 node cluster will experience 1 failure per day. No amount of hardware reliability can completely obviate the need for overlying fault-tolerant infrastructural software. The only solution is to design the level-2 software with the assumption of failure being the default case. The paper then further describes the System Health Infrastructure at Google and presents interesting data about disk, DRAMs reliability and frequency of machine reboots.


Comments/Critiques
The cost model behind the rationale of using low-end servers as the building blocks of datacenter made two simplistic assumptions which may need not necessarily hold for all WSC workloads:
  1. Looking into the details of TPC-C benchmark (http://www.tpc.org/tpcc/detail.asp), it seems like it is a specialized OLTP benchmark which (in HP ProLiant's case) ran on Oracle Database 11g. The price/performance ratio for a strict ACID compliant database may not be comparable to loosely POSIX-compliant MapReduce/HDFS frameworks and other Interactive Query Systems which run on these WSCs. 
  2. The analysis is based on an assumption of a uniformly distributed global store, which again is highly workload dependent. There are a variety of workloads (MapReduce being one of them) in which accesses are not uniform across the store: some data blocks/data nodes are more popular than others. Other workloads (like Pregel) can potentially work on graphs which can be easily partitioned.
While a one-size-fits-all philosophy is definitely economical (for Google, Microsoft, Amazon etc.), I believe that low-end servers are not always an answer for all possible query workloads and access patterns in smaller clusters operated by a variety of other companies.

Another aspect that was missing was the analysis of the failure rate of network components which are also equally important (if not more). A faulty router that computes a wrong header checksums with a high probability can potentially cause far serious disruptions than a faulty node. These are in general even hard to identify and it would have been interesting to know how frequently they occurred.