Monday, November 7, 2011

VL2: A Scalable and Flexible Data Center Network

A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, S. Sengupta, "VL2: A Scalable and Flexible Data Center Network," ACM SIGCOMM 2009, (August 2009).

This paper presented VL2 (Virtual Layer 2) based Data Center Network with the following key features:

  1. Flat addressing to allow services instances to be placed anywhere in the network.
  2. 'Valiant' load balancing (VLB)
  3. End System based address resolution.
The authors established high utilization and agility to be the foremost goals in their design consideration. Current datacenters suffer from over-subscription of resources and nothing is done to prevent a traffic flood from one service to prevent affecting other. Moreover, IP addresses/VLANs are used to achieve scale. This makes things like VM migration extemely complex and introduces broadcasts.

The authors did an extensive analysis of the traffic patterns in datacenters. Howeover, they found out that there is no particular pattern among the datacenter traffic that could be particularly exploited in desing of the routing protocol. So, they too like PortLand, decided to exploit the topology. They talk about assuming a Clos Network which is a rooted tree and links between Intermediate switched and the aggregation switches form a bipartite graph. The underlying idea of VL2 addressing anf routing is having 2 addresses- application specific IP address (AA) and a location specific IP address (LA). The VL2 agent at each server traps packet from the host and encapsulates the packet with the LA address of the ToR switch of the destination. The VL2 agent in turn knows the AA Vs LA mapping by a directory lookup service. Further, the valiant load balancer (VLB) helps in balancing multiple flows of data over randomized paths.

Critique
Overall, I think that the main strength of the paper lies in the fact that all the observations are supported by large scale experiments and data. This paper also is based on Clos topology, which again raises the issue of how apt is to design a generic protocol over a given topology. Further, another issue in VLB was that of path stretch. Howeover, I was not really conviced when the authors said that this is not much of a problem due to the environement (propagation delay is small) and underlying redundant topology. It would be really interesting to discuss the aspects of PortLand and VL2 in the class and discuss about various design decisions that can be cross implemented to improve both these protocols. I would recommend reading an excellent article
It's Microsoft vs. the professors with competing data center architectures and Prof. Amin Vahdat's blog-post that compares PortLand and VL2.

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, A. Vahdat, "PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric", ACM SIGCOMM, (August 2009).
This paper presented PortLand, a Layer 2 Network Fabric for modern Data Centers. The authors laid down the following requirements in order to achieve their goal:
  1. Permit VM migration to any physical machine without having the need to change IP address.
  2. Plug and Play Switches.
  3. Efficient Communication.
  4. No forwarding loops.
  5. Rapid and efficient failure detection.
It is interesting to note here that strictly speaking considering current protocols, these funcationalities can not be solved by a protocol at either layer 2 or 3. While Req 1 can be tackled at layer 3 (IP), it fails to provide plug and play functionality to switches. Further, Req 4 can be tackled by layer 2 and 3 both (by spanning tree protocol or TTL resepctively ). Howeover, Req 5 is not met by either of the layers since the routing protocols (ISIS/OSPF) are broadcast based!

As a result, the authors proposed PortLand based on the assumption of a 'Fat Tree Network Topology':

  1. A lightweight protocol to enable switches to discover their position in the topology. It involves a separate centralized system known as the Fabric Manager which is a user process on a dedicated machine responsible for ARP resolution, fault tolerance and multicast. It maintains the soft state about the network topology. It contains IP - PMAC mappings.
  2. A concept of 48 bit pseudo MAC (PMAC) of the form pod.position.port.vmid which helps in encoding a machine's position in the topology. Any switch can query the fabric manager for the PMAC and hence know the exact location of the destination.
  3. Location Discovery Protocol (LDP) which enables switches to detect their exact topology.
  4. Support for ARP, multicast and broadcast.

Monday, October 31, 2011

The Next Generation of Apache Hadoop MapReduce

Arun C Murthy. The Next Generation of Apache Hadoop MapReduce.
This article presents the design of Hadoop NextGen which tries to scale Hadoop MR clusters beyond 4000 nodes and allow a variety of frameworks to share the same cluster. While the gist of the article and many design principles share key aspects from Mesos (support for multiple frameworks) and even Dryad (per job scheduler as opposed to a single JobTracker for all jobs), being an open-source project and being built as a real system, I think that this project would have a considerable impact in future. 

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. NSDI 2011.
This paper presents Mesos, a platform for sharing resources among various cloud computing frameworks in a fine-grained manner. It introduced a distributed two-level scheduling mechanism in which the central scheduler (or mesos master) distributes the amount of resources among various frameworks, while the individual frameworks themselves decide as to which resources they should accept. While not globally optimal, the authors claim that the system works very well in practice. For a framework to be mesos-compatible, it just requires it to implement a scheduler that registers with the mesos master and an executor process that is launched on slave nodes to run the framework's tasks. It uses resource-filters for efficiency, kills long running tasks (if there is a need) to prevent starvation and uses Linux containers for CPU/Memory isolation. The paper features an impressive evaluation section and the authors show that Mesos can seamlessly scale up to 50,000 virtual nodes on Amazon EC2.


Comments/Critiques:


The paper is very well written and tackles a relevant problem. The authors designed and implemented a real system and took it all the way through (Mesos is now a part of Apache Incubator, deployed in Twitter and 40 node RADLab r-cluster). However, I had a few comments on some aspects of the paper:


  1. Granularity of Scheduling: Mesos presents a per-framework scheduler and assumes each framework to do its own scheduling. However, it was unclear to me if that is the right granularity? How about a per job/per framework scheduler? Though this has other issues with re-writing frameworks etc, this would go a long way to ensure per-framework scalability and even prevent potential conflicts due to 2 level-scheduling.
  2. Delay Scheduling in Hadoop: I think it was interesting that delay scheduling was brought up in this paper, however, I was disappointed that the paper just presented some locality improvements in Hadoop that were in no way related to mesos. I was more interested in the implications due to the 2 levels of scheduler policies. For eg. I was curious to know if delay scheduling might work better if the framework spent some time to pause before accepting resource offers.
  3. Task Resource/Duration prediction: While the paper assumes that each framework can exactly specify its resource requirements. While such an assumption is already made by MR frameworks (by having one slot per task), it is unclear to me if it is always possible on a very generic level.

Monday, October 24, 2011

Database Scalability, Elasticity, and Autonomy in the Cloud

Agrawal, D., Abbadi, A. E., Das, S., and Elmore, A. J. (2011). Database scalability, elasticity, and autonomy in the cloud - (extended abstract). In Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011.
This paper, like the previous one, also discusses the challenges of designing a 'Database-as-a-service' model in the cloud. However, unlike Relational Cloud, it discusses the pros and cons of each design choice and analyzes their inherent trade-offs for a non-relational DBMS. The paper highlights and attempts to tackle 2 key challenges:
  1. Traditional DBMSs cannot be scaled very easily. This is because in traditional DBMSs, the databased is always treated as a "whole" and the DBMS tries to guarantee the consistency of the entire data. This is in contrast with the popular key-value stores in which each key-value is treated as an independent unit and consistency is guaranteed for each individual key. However, the authors claim that either of these two approaches in the strict sense are not sufficient and DBMS generally requires consistency between a group of key-value pairs on a per-application and even on a time-to-time basis. Such a system can either be implemented by having multi-key atomicity in key-value stores (data-fusion) or by partitioning traditional DBMS into 'entity groups' (data-fission)
  2. Traditional DBMSs are statically provisioned. Since traditional DBMSs were always intended for enterprise infrastructure, they are statically provisioned. A query plan is in some sense binds-early to the resource. The lack of automatic workload dependent management not only prevents the queries to scale-out or scale-in depending on the load, it also prevents individual DBMSs to share optimally physical resources. To this end, similar to Relational Cloud, the authors also presented a load balancer with both static and dynamic components along with Zephyr, a technique for live-migration of instances
Comments/Critiques:

It was very interesting to see that both- this paper and the relational cloud paper identify similar challenges to implement DBaaS solutions. I liked reading this paper mainly because it explained and analyzed various trade-offs in a neutral fashion and even helped me understand the Relational Cloud paper better (eg. Data Fission/Partitioning arguments). In general, the paper presents an interesting case for implementing non-relational DBMS in the cloud because they scale well and most applications don't require strong consistency semantics anyways as opposed to the Relational Cloud paper. However, a key question to answer would be that who are the customers of this pay-per-use DBaaS model? Are these small enterprises/websites or are these fairly large enterprises? If they are the former, do we really need to worry about scaling problems of individual DBMSs given that it is highly likely that their databases would easily fit in the same datacenter?

Relational Cloud: A Database-as-a-Service for the Cloud

C. Curino, E. Jones, R. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and N. Zeldovich. Relational Cloud: A Database Service for the Cloud. In CIDR, pages 235–240, 2011.
This paper introduces a transactional 'Database-as-a-Service" (DBaaS) model called Relational Cloud. The paper highlights and attempts to tackle three important challenges:
  1. Efficient Multi-TenancyThe goal here is to efficiently 'pack' as many databases on a given set of physical machines while meeting their query performance SLAs. Relational Cloud tries to identify the best way to map a given a set of databases and workloads to a set of physical machines. The authors identify that just using a DB-in-VM strategy doesn't scale too well because each database ends up having its own buffer pool, cache and log etc. which compete (and hence conflict) for the same physical resources. Relational Cloud consists of a monitoring and consolidation engine called Kairos which has a resource monitor (that monitors resource usage such as memory, disk access etc.), combined load predictor (which models the CPU, RAM and Disk usage of consolidated workloads) and a consolidation engine (which uses non-linear optimization techniques to figure out the mapping between workloads and physical machines).
  2. Elastic Scalability: The goal here to enable the DBMS to efficiently scale-out when the workload exceeds the capacity of a single machine. Relational Cloud uses graph partitioning algorithms to automatically analyze complex query workloads and map data items to physical nodes in order to minimize multi-node transactions. From a high level, it tries to partition data in a way such that tuples which are frequently accessed together are stored together on the same physical node. The algorithm simply tries to create a graph with the tuples as nodes and the weights of the edges between them signify the frequency with which they are accessed together. To scale this graph representation, it uses various sampling/exclusion heuristics. 
  3. Database Privacy: Being a CloudDB, Relational Cloud uses CryptDB which introduces an interesting concept of adjustable security. CryptDB employs different encryption levels (RND, DET, OPE or HOM) for different types of operations on data. This enables the queries to be evaluated on encrypted data and sent back to the client for final decryption.

Comments/Critiques

Being more of a high-level paper, there was a lot of secret-sauce in the individual components that I would have been more interested in. For eg. It is unclear to me how the tuple graph would look like for OLAP workloads (as opposed to OLTP/Web workloads as described in the paper) and if OLAP workloads can even be easily partitioned on physical nodes as compared to their OLTP counterparts. As far as Kairos was concerned, the key component there is the Combined Load Predictor which models consolidated CPU, RAM and Disk for multiple workloads. It is unclear to me if it can predict sudden spikes or other non-historical workload characteristics. The key component that I found missing from the paper was a section on the effects of consolidation on individual query latency. As highlighted in the second paper that we read for the class, traditional DBs bind queries based on a static resource assumption. If this is true, the effect of Kairos mis-predictions on individual query latency is quite unclear from this paper.

Overall, I think Relational Cloud is a great initiative towards DBaaS model and is much better fit than its counterparts in its approach (Amazon RDS and Microsoft SQL Azure). However, the question I am unclear about is that do we really need a Relational DBaaS model in the cloud as opposed to weak consistency/entity-group based models?


Monday, October 17, 2011

PNUTS: Yahoo!'s Hosted Data Serving Platform

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. In PVLDB 2008.
PNUTS is a parallel, geographically distributed database that supports Yahoo!'s web applications. It chooses high-availability, low response times and scalability over strict consistency. The authors claim that since web applications typically manipulate one record-at-a-time, while different records may have activity in different geographic localities. To this end, they provide a per-record timeline consistency model, where all replicas of a given record apply all the updates made to it in the same order. This is achieved by having a per-record master to make sure that all updates are in order. Another key component of the design is the Pub-Sub Message System used for record-level asynchronous geographical replication which uses a (message broker based) guaranteed message delivery service rather than a persistent log. Lastly, PNUTS trusts the application designers to do the right thing in terms for their consistency needs by providing support for hashed and ordered table organizations, and more importantly API calls like read-any, read-latest, read-critical (version specific) etc.

Comments/Critiques:

PNUTS is an interesting design point in this space due to its simplicity by trusting the application designers to do the right thing. This somehow feels like an anti-Google system design philosophy in some sense (which always puts 'making the job of the application designer easy' as their top priority). However, it remains to be seen if application designers can write their applications by properly leveraging the trade-offs (e.g. where should they use read-latest vs. read-critical?). Secondly, it is unclear to me if their lack of support for referential integrity and complex queries (with joins etc.) is a major drawback or not for continuously evolving web applications. However, in all, I think PNUTS is definitely an interesting design and things like per-record timeline consistency would find its way in many future systems.

Monday, September 26, 2011

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M Léon, Y. Li, A. Lloyd, V. Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. CIDR 2011 
This paper talks about Megastore, a storage framework developed and used at Google that guarentees NoSQL based scalability, high availability and fairly strong ACID based semantics. In my view, the prime focus of the paper was essentially to strike a balance between 3 inherent tradeoffs: Scalability, Wide Area Replication and ACID based semantics. While Brewer's CAP theorem only guarantees us any 2 out of Consistency, Availability and Partition Tolerance, Megastore successfully highlights that we can build real world systems that could strike a good balance between these 3 properties. The following points very briefly highlight the key contributions of this paper in each of the aforementioned areas:

  1. Scalability: Partitions data into entity groups (which can be thought of as a small database) and stores them in Bigtable.
  2. ACID Transactions: Maintains a write-ahead log per entity group to track all writes. Further, it performs two-phase commits or implements asynchronous queues to pass messages between entity groups.
  3. Wide-Area Replication: It used Paxos for replicating data (which is kind of cool). Further, it tweaked it for optimal latency. (On average, it incurs an overhead of 1 WAN RTT for writes and 0 WAN RTTs i.e. local reads).
 It also presents a schema language as well which allows applications to have full control on the query plan (indices, joins etc.) as well as placement of data.

Comments/Critiques:

As I highlighted in the class, the success of MegaStore depends on its wide applicability to a variety of workloads that could be cleanly divided between entity groups without incurring too much realtime intra-group communication overhead. While this does work with emails, blogs, maps etc., it is unclear to me if it data partitioning into entity groups would still be as trivial in (say) complex social graphs. Recent developments like Facebook Ticker or publishing realtime twitter feeds doesn't fit too well in the Megastore Entity-Group model.

Secondly, it is unclear to me if Megastore made the right tradeoffs by choosing consistency over performance. While the paper does try to paint a very rosy picture in terms of query latencies, a write latency of 100-400ms and read latency of 10ms is not negligible and hiding it requires considerable effort on part of developers.

Last but not the least, while building Megastore on their exisiting BigTable based infrastructure does make sense for Google in terms of better management, scalability and performance, it is a bit unclear to me if that is indeed a right way to go? What if Google had used MySQL servers for each set of entity groups? A general question might be: Are NoSQL DataStores the right building blocks for data parallel computing? Or would a collection of RDBMS's do the same job well?   

Monday, September 12, 2011

Solid State Devices (SSDs): PerformanceModeling and Analysis of Flash-based Storage Devices

H. Howie Huang, Shan Li, Alex Szalay, Andreas Terzi. Solid State Devices (SSDs): PerformanceModeling and Analysis of Flash-based Storage Devices. MSST 2011.

This paper models and analyzes the performance of flash-based Storage Devices (SSDs) with a Linear Regression Tree . Compares Intel and Samsung SSDs. They compare and analyze 3 SSDs: Intel X-25M, OCZ Apex and Samsung using a a couple of transactional and web search workloads.

Comments/Critiques:
  1. Future of SSDs: I think flash storage would definitely have a place in future WSCs. However, in the next 10 years, I am unsure whether they will be used as a caching layer between the DRAM and the Disks or will replace Disks themselves (like Disks replaced magnetic tapes). An interesting trend to look out for will be how price/GB of DRAMs, SSDs and Disks vary over time.
  2. MTBF of SSDs: An interesting analysis to look out for will be to find out the mean time between failures of SSDs. This in turn would affect the choice of their wide adoption in datacenters.
  3. Fast Reads; Slower writes: Luiz Barrosso mentioned in his talk that poor write performance is one of the key challenges in using SSDs as a storage device. It would be interesting to see if they will find a place as the infrastructure-of-choice for read heavy workloads.
  4. Rethinking Infrastructural Software Stack: If the SSDs indeed replace the disks, this would call for  a redesign of the underlying file system itself. Currently GFS/HDFS for instance heavily depend on sequential writes as opposed to random writes. While SSDs are better suited for the latter. 


Multicore CPUs: Amdahl's Law in the Multicore Era

Mark D. Hill, Michael R. Marty. Multicore CPUs: Amdahl's Law in the Multicore Era.
This paper analyzes the tradeoffs between adding more cores to a single chip Vs. increasing single threaded parallelism. It proposes three different alternatives of multicore chip design (Symmetric, Asymmetric and Dynamic) and discusses their tradeoffs, implications and impact on future research in chip architecture and software development. In general, Asymmetric and Dynamic multicores are better than symmetric ones because they can handle both parallel and serial components of a program equally well. Serial computation is generally performed in the bigger cores while the smaller cores are responsible for executing parallelizable code components.

Several points came to my mind while reading this article:

  1. Hierarchical clusters (or clusters of clusters): With 1000s of cores on a chip (as predicted by folks from Intel and Berkeley), might result in what could be termed as a cluster of cluster of cores with a 2-level infrastructural software stack. This may lead to a trend towards multicore Operating Systems for doing better resource arbitration, scheduling and isolation with a multicore chip.
  2. Trends towards GPU/CPUs as Asymmetric Multicores: A combination of a GPU and CPU in some sense can be thought of as an asymmetric multicore. It would be interesting to see how well the software stack can leverage the underlying hardware.
  3. Workload Dependent Clusters: It almost feels like that there should be a trend towards workload-dependent clusters. A one-size fits all philosophy might not hold true for different workloads which inherently differ in their serializable and parallelizable components.
  4. Latency Variation: If asymmetric and dynamic multicores are popularized as the building block of future clusters, the homogeneity arguments will no longer hold true. This might lead to increase in variance (and decrease predictability) of running time of programs. In this light, the new multicore software stack will have to be re-written in order to embrace heterogeneity.
A question that we must try to answer is that if we will see these 100+ multicore clusters in 10 years? Barrozo/Hozle's arguments in favor of using commodity servers were primarily concentrated on economies of scale. So, the answer to this question would, I believe, depend on how quickly these multicore processors are embraced by the mainstream consumer market.

Wednesday, September 7, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (Part 2)

Luiz André Barroso and Urs Hölzle Google Inc. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Chapters 3,4,7
Server Hardware, Networking Fabric and Storage Hierarchy Components are the 3 key building blocks of Warehouse Scale Computers (WSCs). In these series of chapters, the authors explained the rationale behind using clusters of low-end servers (or higher-end personal computers) as the computational fabric for WSCs. Based on an analysis of TPC-C benchmark and assumption of a uniformly distributed global store, they showed that on one hand, the performance edge of cluster based on high-end server deteriorates quickly as the cluster size increases (due to communication costs), and on the other, using low-end embedded/PC-class components results in sustaining considerably higher costs for software development and additional networking fabric. Further, smaller servers may lead to low utilization (due to the inability of small 'bins' to fit multiple tasks) and makes embarrassingly parallel programs intrinsically less efficient because they may want to check for global information. Hence, building clusters of Low-end servers are a good middle ground.


The authors further presented an architectural view of the datacenter in terms of its power and cooling requirements. A key aspect that was highlighted was the redundancy that is inherent in this design. Since a bulk of the construction costs are proportional to the amount of power delivered and the amount of heat dissipated, Google is experimenting with a variety of solutions such as water-based free cooling (wherein ambient air is drawn past a flow of water, lowering its temperature), glycol-based radiators (using glycol instead of water to prevent freezing in cold places), in-rack cooling (adding an air-to-water exchanger as the back of the rack so that hot air exiting the rack is immediately cooled). An interesting trend would be to look out for the emergence of Container-Based Datacenters, which integrate heat exchange and power distribution in the container itself achieving extremely high energy efficiency.


Finally, the authors highlighted as to how failures and repairs are dealt with. One key aspect of WSCs is their size. Even with a steller mean time between failures (MTBF), of say 30 years (10000 days),  a 10000 node cluster will experience 1 failure per day. No amount of hardware reliability can completely obviate the need for overlying fault-tolerant infrastructural software. The only solution is to design the level-2 software with the assumption of failure being the default case. The paper then further describes the System Health Infrastructure at Google and presents interesting data about disk, DRAMs reliability and frequency of machine reboots.


Comments/Critiques
The cost model behind the rationale of using low-end servers as the building blocks of datacenter made two simplistic assumptions which may need not necessarily hold for all WSC workloads:
  1. Looking into the details of TPC-C benchmark (http://www.tpc.org/tpcc/detail.asp), it seems like it is a specialized OLTP benchmark which (in HP ProLiant's case) ran on Oracle Database 11g. The price/performance ratio for a strict ACID compliant database may not be comparable to loosely POSIX-compliant MapReduce/HDFS frameworks and other Interactive Query Systems which run on these WSCs. 
  2. The analysis is based on an assumption of a uniformly distributed global store, which again is highly workload dependent. There are a variety of workloads (MapReduce being one of them) in which accesses are not uniform across the store: some data blocks/data nodes are more popular than others. Other workloads (like Pregel) can potentially work on graphs which can be easily partitioned.
While a one-size-fits-all philosophy is definitely economical (for Google, Microsoft, Amazon etc.), I believe that low-end servers are not always an answer for all possible query workloads and access patterns in smaller clusters operated by a variety of other companies.

Another aspect that was missing was the analysis of the failure rate of network components which are also equally important (if not more). A faulty router that computes a wrong header checksums with a high probability can potentially cause far serious disruptions than a faulty node. These are in general even hard to identify and it would have been interesting to know how frequently they occurred.

Tuesday, August 30, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (Part 1)

Luiz André Barroso and Urs Hölzle Google Inc. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.
The emergence of popular Internet services and availability of high-speed connectivity have accelerated a trend towards Server Side or Cloud Computing. This paradigm is primarily driven by ease of management, ubiquity of access, faster application development and exploiting economies of scale in a shared environment which ultimately allow many application services to run at a low cost per user. The trend towards server-side computing has created a new class of computing systems which the authors named as warehouse-scale computers (WSCs). These warehouse-scale computers operate at a massive scale of tens of thousands of machines and power the services offered by companies like Google, Microsoft, Amazon, Yahoo etc. However, they differ from traditional datacenters as they are generally owned by a single organization, use a relatively homogenous hardware and software platform and share a common systems management layer.

In the first chapter, the authors highlight the tradeoffs and architectural decisions made by Google while setting up their fleet of WSCs in terms of Storage (GFS instead of NAS), Networking Fabric (spending more on fabric interconnect by building "fat tree" Clos networks as opposed investing in high end switches), Storage Hierarchy (exploiting L1/L2 caches and setting up a Local/Rack/Datacenter DRAM and Disk hierarchy and exposing it to application designers) and Power Usage (taking steps towards power-proportional clusters).

The second chapter gave us a sneak preview into a variety of Google's query workloads: Index Generation, Search and MapReduce based similarity detection in scholarly articles. Due to the nature of these workloads, the entire software stack was designed to achieve high performance and availability (by replication, sharding, load balancing), sometimes at the cost of strong consistency. Overall, the stack was divided into 3 layers:
  1. Platform-level Software: The firmware/kernel running on individual machines.
  2. Cluster-level Software: The distributed system software that manage resources/services on a cluster level. (eg. MapReduce, BigTable, Dryad etc.)
  3. Application-level Software: The software that implements a particular service like GMail, Search etc.
Finally, the authors highlighted the importance of debugging tools in datacenters and briefly compared black box monitoring tools (like WAP5, Sherlock) with Instrumentation-based tracing schemes (like X-trace). Due to a greater accuracy of the latter, Google developed Dapper, a lightweight annotation-based tracing tool.

Above the Clouds: A Berkeley View of Cloud Computing

Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia. Above the Clouds: A Berkeley View of Cloud Computing.
The authors identified three properties that gives Cloud Computing its appeal: short-term usage (which allows for scaling up/down resources based on demand), no up-front cost, and (an illusion of) infinite capacity on-demand. Due to the inherent elasticity and flexibility that cloud computing provides, any application that needs a model of computation, storage and/or communication fits perfectly in this paradigm.

A key decision for any application provider is to whether invest in the cloud on a pay-per-use basis or incur a one-time cost of setting up the whole datacenter. This paper highlighted the reasoning and rationale that should be put in before making those decisions in terms of long term economical benefits and being risk averse in case of load spikes.

Finally, the paper highlighted some key obstacles to the growth of cloud computing, and paired each of them with opportunities that may result:
  1. Availability of the Service: Prevention against DDoS attacks, large scale failures.
  2. Data Lock-In: Losing customer data due to dependence on a variety of service providers
  3. Data Confidentiality: Making sure that the data is secure.
  4. Data Transfer Bottlenecks: Tranferring user data to the datacenter economically.
  5. Performance Unpredictability: Achieving disk I/O performance isolation even though the resources are shared.
  6. Scalable Storage: Making sure that the store achieves scalability, data durability and high availability even as the size of data increases.
  7. Distributed Debugging: Identifying bugs in Distributed Systems.
  8. Scaling Quickly
  9. Reputation Fate Sharing: Identifying adversaries, spammers.
  10. Software Licensing: Solving challenges that result due to proprietary software licensing issues.