Monday, November 7, 2011

VL2: A Scalable and Flexible Data Center Network

A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, S. Sengupta, "VL2: A Scalable and Flexible Data Center Network," ACM SIGCOMM 2009, (August 2009).

This paper presented VL2 (Virtual Layer 2) based Data Center Network with the following key features:

  1. Flat addressing to allow services instances to be placed anywhere in the network.
  2. 'Valiant' load balancing (VLB)
  3. End System based address resolution.
The authors established high utilization and agility to be the foremost goals in their design consideration. Current datacenters suffer from over-subscription of resources and nothing is done to prevent a traffic flood from one service to prevent affecting other. Moreover, IP addresses/VLANs are used to achieve scale. This makes things like VM migration extemely complex and introduces broadcasts.

The authors did an extensive analysis of the traffic patterns in datacenters. Howeover, they found out that there is no particular pattern among the datacenter traffic that could be particularly exploited in desing of the routing protocol. So, they too like PortLand, decided to exploit the topology. They talk about assuming a Clos Network which is a rooted tree and links between Intermediate switched and the aggregation switches form a bipartite graph. The underlying idea of VL2 addressing anf routing is having 2 addresses- application specific IP address (AA) and a location specific IP address (LA). The VL2 agent at each server traps packet from the host and encapsulates the packet with the LA address of the ToR switch of the destination. The VL2 agent in turn knows the AA Vs LA mapping by a directory lookup service. Further, the valiant load balancer (VLB) helps in balancing multiple flows of data over randomized paths.

Critique
Overall, I think that the main strength of the paper lies in the fact that all the observations are supported by large scale experiments and data. This paper also is based on Clos topology, which again raises the issue of how apt is to design a generic protocol over a given topology. Further, another issue in VLB was that of path stretch. Howeover, I was not really conviced when the authors said that this is not much of a problem due to the environement (propagation delay is small) and underlying redundant topology. It would be really interesting to discuss the aspects of PortLand and VL2 in the class and discuss about various design decisions that can be cross implemented to improve both these protocols. I would recommend reading an excellent article
It's Microsoft vs. the professors with competing data center architectures and Prof. Amin Vahdat's blog-post that compares PortLand and VL2.

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, A. Vahdat, "PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric", ACM SIGCOMM, (August 2009).
This paper presented PortLand, a Layer 2 Network Fabric for modern Data Centers. The authors laid down the following requirements in order to achieve their goal:
  1. Permit VM migration to any physical machine without having the need to change IP address.
  2. Plug and Play Switches.
  3. Efficient Communication.
  4. No forwarding loops.
  5. Rapid and efficient failure detection.
It is interesting to note here that strictly speaking considering current protocols, these funcationalities can not be solved by a protocol at either layer 2 or 3. While Req 1 can be tackled at layer 3 (IP), it fails to provide plug and play functionality to switches. Further, Req 4 can be tackled by layer 2 and 3 both (by spanning tree protocol or TTL resepctively ). Howeover, Req 5 is not met by either of the layers since the routing protocols (ISIS/OSPF) are broadcast based!

As a result, the authors proposed PortLand based on the assumption of a 'Fat Tree Network Topology':

  1. A lightweight protocol to enable switches to discover their position in the topology. It involves a separate centralized system known as the Fabric Manager which is a user process on a dedicated machine responsible for ARP resolution, fault tolerance and multicast. It maintains the soft state about the network topology. It contains IP - PMAC mappings.
  2. A concept of 48 bit pseudo MAC (PMAC) of the form pod.position.port.vmid which helps in encoding a machine's position in the topology. Any switch can query the fabric manager for the PMAC and hence know the exact location of the destination.
  3. Location Discovery Protocol (LDP) which enables switches to detect their exact topology.
  4. Support for ARP, multicast and broadcast.

Monday, October 31, 2011

The Next Generation of Apache Hadoop MapReduce

Arun C Murthy. The Next Generation of Apache Hadoop MapReduce.
This article presents the design of Hadoop NextGen which tries to scale Hadoop MR clusters beyond 4000 nodes and allow a variety of frameworks to share the same cluster. While the gist of the article and many design principles share key aspects from Mesos (support for multiple frameworks) and even Dryad (per job scheduler as opposed to a single JobTracker for all jobs), being an open-source project and being built as a real system, I think that this project would have a considerable impact in future. 

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. NSDI 2011.
This paper presents Mesos, a platform for sharing resources among various cloud computing frameworks in a fine-grained manner. It introduced a distributed two-level scheduling mechanism in which the central scheduler (or mesos master) distributes the amount of resources among various frameworks, while the individual frameworks themselves decide as to which resources they should accept. While not globally optimal, the authors claim that the system works very well in practice. For a framework to be mesos-compatible, it just requires it to implement a scheduler that registers with the mesos master and an executor process that is launched on slave nodes to run the framework's tasks. It uses resource-filters for efficiency, kills long running tasks (if there is a need) to prevent starvation and uses Linux containers for CPU/Memory isolation. The paper features an impressive evaluation section and the authors show that Mesos can seamlessly scale up to 50,000 virtual nodes on Amazon EC2.


Comments/Critiques:


The paper is very well written and tackles a relevant problem. The authors designed and implemented a real system and took it all the way through (Mesos is now a part of Apache Incubator, deployed in Twitter and 40 node RADLab r-cluster). However, I had a few comments on some aspects of the paper:


  1. Granularity of Scheduling: Mesos presents a per-framework scheduler and assumes each framework to do its own scheduling. However, it was unclear to me if that is the right granularity? How about a per job/per framework scheduler? Though this has other issues with re-writing frameworks etc, this would go a long way to ensure per-framework scalability and even prevent potential conflicts due to 2 level-scheduling.
  2. Delay Scheduling in Hadoop: I think it was interesting that delay scheduling was brought up in this paper, however, I was disappointed that the paper just presented some locality improvements in Hadoop that were in no way related to mesos. I was more interested in the implications due to the 2 levels of scheduler policies. For eg. I was curious to know if delay scheduling might work better if the framework spent some time to pause before accepting resource offers.
  3. Task Resource/Duration prediction: While the paper assumes that each framework can exactly specify its resource requirements. While such an assumption is already made by MR frameworks (by having one slot per task), it is unclear to me if it is always possible on a very generic level.

Monday, October 24, 2011

Database Scalability, Elasticity, and Autonomy in the Cloud

Agrawal, D., Abbadi, A. E., Das, S., and Elmore, A. J. (2011). Database scalability, elasticity, and autonomy in the cloud - (extended abstract). In Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011.
This paper, like the previous one, also discusses the challenges of designing a 'Database-as-a-service' model in the cloud. However, unlike Relational Cloud, it discusses the pros and cons of each design choice and analyzes their inherent trade-offs for a non-relational DBMS. The paper highlights and attempts to tackle 2 key challenges:
  1. Traditional DBMSs cannot be scaled very easily. This is because in traditional DBMSs, the databased is always treated as a "whole" and the DBMS tries to guarantee the consistency of the entire data. This is in contrast with the popular key-value stores in which each key-value is treated as an independent unit and consistency is guaranteed for each individual key. However, the authors claim that either of these two approaches in the strict sense are not sufficient and DBMS generally requires consistency between a group of key-value pairs on a per-application and even on a time-to-time basis. Such a system can either be implemented by having multi-key atomicity in key-value stores (data-fusion) or by partitioning traditional DBMS into 'entity groups' (data-fission)
  2. Traditional DBMSs are statically provisioned. Since traditional DBMSs were always intended for enterprise infrastructure, they are statically provisioned. A query plan is in some sense binds-early to the resource. The lack of automatic workload dependent management not only prevents the queries to scale-out or scale-in depending on the load, it also prevents individual DBMSs to share optimally physical resources. To this end, similar to Relational Cloud, the authors also presented a load balancer with both static and dynamic components along with Zephyr, a technique for live-migration of instances
Comments/Critiques:

It was very interesting to see that both- this paper and the relational cloud paper identify similar challenges to implement DBaaS solutions. I liked reading this paper mainly because it explained and analyzed various trade-offs in a neutral fashion and even helped me understand the Relational Cloud paper better (eg. Data Fission/Partitioning arguments). In general, the paper presents an interesting case for implementing non-relational DBMS in the cloud because they scale well and most applications don't require strong consistency semantics anyways as opposed to the Relational Cloud paper. However, a key question to answer would be that who are the customers of this pay-per-use DBaaS model? Are these small enterprises/websites or are these fairly large enterprises? If they are the former, do we really need to worry about scaling problems of individual DBMSs given that it is highly likely that their databases would easily fit in the same datacenter?

Relational Cloud: A Database-as-a-Service for the Cloud

C. Curino, E. Jones, R. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and N. Zeldovich. Relational Cloud: A Database Service for the Cloud. In CIDR, pages 235–240, 2011.
This paper introduces a transactional 'Database-as-a-Service" (DBaaS) model called Relational Cloud. The paper highlights and attempts to tackle three important challenges:
  1. Efficient Multi-TenancyThe goal here is to efficiently 'pack' as many databases on a given set of physical machines while meeting their query performance SLAs. Relational Cloud tries to identify the best way to map a given a set of databases and workloads to a set of physical machines. The authors identify that just using a DB-in-VM strategy doesn't scale too well because each database ends up having its own buffer pool, cache and log etc. which compete (and hence conflict) for the same physical resources. Relational Cloud consists of a monitoring and consolidation engine called Kairos which has a resource monitor (that monitors resource usage such as memory, disk access etc.), combined load predictor (which models the CPU, RAM and Disk usage of consolidated workloads) and a consolidation engine (which uses non-linear optimization techniques to figure out the mapping between workloads and physical machines).
  2. Elastic Scalability: The goal here to enable the DBMS to efficiently scale-out when the workload exceeds the capacity of a single machine. Relational Cloud uses graph partitioning algorithms to automatically analyze complex query workloads and map data items to physical nodes in order to minimize multi-node transactions. From a high level, it tries to partition data in a way such that tuples which are frequently accessed together are stored together on the same physical node. The algorithm simply tries to create a graph with the tuples as nodes and the weights of the edges between them signify the frequency with which they are accessed together. To scale this graph representation, it uses various sampling/exclusion heuristics. 
  3. Database Privacy: Being a CloudDB, Relational Cloud uses CryptDB which introduces an interesting concept of adjustable security. CryptDB employs different encryption levels (RND, DET, OPE or HOM) for different types of operations on data. This enables the queries to be evaluated on encrypted data and sent back to the client for final decryption.

Comments/Critiques

Being more of a high-level paper, there was a lot of secret-sauce in the individual components that I would have been more interested in. For eg. It is unclear to me how the tuple graph would look like for OLAP workloads (as opposed to OLTP/Web workloads as described in the paper) and if OLAP workloads can even be easily partitioned on physical nodes as compared to their OLTP counterparts. As far as Kairos was concerned, the key component there is the Combined Load Predictor which models consolidated CPU, RAM and Disk for multiple workloads. It is unclear to me if it can predict sudden spikes or other non-historical workload characteristics. The key component that I found missing from the paper was a section on the effects of consolidation on individual query latency. As highlighted in the second paper that we read for the class, traditional DBs bind queries based on a static resource assumption. If this is true, the effect of Kairos mis-predictions on individual query latency is quite unclear from this paper.

Overall, I think Relational Cloud is a great initiative towards DBaaS model and is much better fit than its counterparts in its approach (Amazon RDS and Microsoft SQL Azure). However, the question I am unclear about is that do we really need a Relational DBaaS model in the cloud as opposed to weak consistency/entity-group based models?


Monday, October 17, 2011

PNUTS: Yahoo!'s Hosted Data Serving Platform

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. In PVLDB 2008.
PNUTS is a parallel, geographically distributed database that supports Yahoo!'s web applications. It chooses high-availability, low response times and scalability over strict consistency. The authors claim that since web applications typically manipulate one record-at-a-time, while different records may have activity in different geographic localities. To this end, they provide a per-record timeline consistency model, where all replicas of a given record apply all the updates made to it in the same order. This is achieved by having a per-record master to make sure that all updates are in order. Another key component of the design is the Pub-Sub Message System used for record-level asynchronous geographical replication which uses a (message broker based) guaranteed message delivery service rather than a persistent log. Lastly, PNUTS trusts the application designers to do the right thing in terms for their consistency needs by providing support for hashed and ordered table organizations, and more importantly API calls like read-any, read-latest, read-critical (version specific) etc.

Comments/Critiques:

PNUTS is an interesting design point in this space due to its simplicity by trusting the application designers to do the right thing. This somehow feels like an anti-Google system design philosophy in some sense (which always puts 'making the job of the application designer easy' as their top priority). However, it remains to be seen if application designers can write their applications by properly leveraging the trade-offs (e.g. where should they use read-latest vs. read-critical?). Secondly, it is unclear to me if their lack of support for referential integrity and complex queries (with joins etc.) is a major drawback or not for continuously evolving web applications. However, in all, I think PNUTS is definitely an interesting design and things like per-record timeline consistency would find its way in many future systems.