Monday, October 31, 2011

The Next Generation of Apache Hadoop MapReduce

Arun C Murthy. The Next Generation of Apache Hadoop MapReduce.
This article presents the design of Hadoop NextGen which tries to scale Hadoop MR clusters beyond 4000 nodes and allow a variety of frameworks to share the same cluster. While the gist of the article and many design principles share key aspects from Mesos (support for multiple frameworks) and even Dryad (per job scheduler as opposed to a single JobTracker for all jobs), being an open-source project and being built as a real system, I think that this project would have a considerable impact in future. 

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. NSDI 2011.
This paper presents Mesos, a platform for sharing resources among various cloud computing frameworks in a fine-grained manner. It introduced a distributed two-level scheduling mechanism in which the central scheduler (or mesos master) distributes the amount of resources among various frameworks, while the individual frameworks themselves decide as to which resources they should accept. While not globally optimal, the authors claim that the system works very well in practice. For a framework to be mesos-compatible, it just requires it to implement a scheduler that registers with the mesos master and an executor process that is launched on slave nodes to run the framework's tasks. It uses resource-filters for efficiency, kills long running tasks (if there is a need) to prevent starvation and uses Linux containers for CPU/Memory isolation. The paper features an impressive evaluation section and the authors show that Mesos can seamlessly scale up to 50,000 virtual nodes on Amazon EC2.


Comments/Critiques:


The paper is very well written and tackles a relevant problem. The authors designed and implemented a real system and took it all the way through (Mesos is now a part of Apache Incubator, deployed in Twitter and 40 node RADLab r-cluster). However, I had a few comments on some aspects of the paper:


  1. Granularity of Scheduling: Mesos presents a per-framework scheduler and assumes each framework to do its own scheduling. However, it was unclear to me if that is the right granularity? How about a per job/per framework scheduler? Though this has other issues with re-writing frameworks etc, this would go a long way to ensure per-framework scalability and even prevent potential conflicts due to 2 level-scheduling.
  2. Delay Scheduling in Hadoop: I think it was interesting that delay scheduling was brought up in this paper, however, I was disappointed that the paper just presented some locality improvements in Hadoop that were in no way related to mesos. I was more interested in the implications due to the 2 levels of scheduler policies. For eg. I was curious to know if delay scheduling might work better if the framework spent some time to pause before accepting resource offers.
  3. Task Resource/Duration prediction: While the paper assumes that each framework can exactly specify its resource requirements. While such an assumption is already made by MR frameworks (by having one slot per task), it is unclear to me if it is always possible on a very generic level.

Monday, October 24, 2011

Database Scalability, Elasticity, and Autonomy in the Cloud

Agrawal, D., Abbadi, A. E., Das, S., and Elmore, A. J. (2011). Database scalability, elasticity, and autonomy in the cloud - (extended abstract). In Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011.
This paper, like the previous one, also discusses the challenges of designing a 'Database-as-a-service' model in the cloud. However, unlike Relational Cloud, it discusses the pros and cons of each design choice and analyzes their inherent trade-offs for a non-relational DBMS. The paper highlights and attempts to tackle 2 key challenges:
  1. Traditional DBMSs cannot be scaled very easily. This is because in traditional DBMSs, the databased is always treated as a "whole" and the DBMS tries to guarantee the consistency of the entire data. This is in contrast with the popular key-value stores in which each key-value is treated as an independent unit and consistency is guaranteed for each individual key. However, the authors claim that either of these two approaches in the strict sense are not sufficient and DBMS generally requires consistency between a group of key-value pairs on a per-application and even on a time-to-time basis. Such a system can either be implemented by having multi-key atomicity in key-value stores (data-fusion) or by partitioning traditional DBMS into 'entity groups' (data-fission)
  2. Traditional DBMSs are statically provisioned. Since traditional DBMSs were always intended for enterprise infrastructure, they are statically provisioned. A query plan is in some sense binds-early to the resource. The lack of automatic workload dependent management not only prevents the queries to scale-out or scale-in depending on the load, it also prevents individual DBMSs to share optimally physical resources. To this end, similar to Relational Cloud, the authors also presented a load balancer with both static and dynamic components along with Zephyr, a technique for live-migration of instances
Comments/Critiques:

It was very interesting to see that both- this paper and the relational cloud paper identify similar challenges to implement DBaaS solutions. I liked reading this paper mainly because it explained and analyzed various trade-offs in a neutral fashion and even helped me understand the Relational Cloud paper better (eg. Data Fission/Partitioning arguments). In general, the paper presents an interesting case for implementing non-relational DBMS in the cloud because they scale well and most applications don't require strong consistency semantics anyways as opposed to the Relational Cloud paper. However, a key question to answer would be that who are the customers of this pay-per-use DBaaS model? Are these small enterprises/websites or are these fairly large enterprises? If they are the former, do we really need to worry about scaling problems of individual DBMSs given that it is highly likely that their databases would easily fit in the same datacenter?

Relational Cloud: A Database-as-a-Service for the Cloud

C. Curino, E. Jones, R. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and N. Zeldovich. Relational Cloud: A Database Service for the Cloud. In CIDR, pages 235–240, 2011.
This paper introduces a transactional 'Database-as-a-Service" (DBaaS) model called Relational Cloud. The paper highlights and attempts to tackle three important challenges:
  1. Efficient Multi-TenancyThe goal here is to efficiently 'pack' as many databases on a given set of physical machines while meeting their query performance SLAs. Relational Cloud tries to identify the best way to map a given a set of databases and workloads to a set of physical machines. The authors identify that just using a DB-in-VM strategy doesn't scale too well because each database ends up having its own buffer pool, cache and log etc. which compete (and hence conflict) for the same physical resources. Relational Cloud consists of a monitoring and consolidation engine called Kairos which has a resource monitor (that monitors resource usage such as memory, disk access etc.), combined load predictor (which models the CPU, RAM and Disk usage of consolidated workloads) and a consolidation engine (which uses non-linear optimization techniques to figure out the mapping between workloads and physical machines).
  2. Elastic Scalability: The goal here to enable the DBMS to efficiently scale-out when the workload exceeds the capacity of a single machine. Relational Cloud uses graph partitioning algorithms to automatically analyze complex query workloads and map data items to physical nodes in order to minimize multi-node transactions. From a high level, it tries to partition data in a way such that tuples which are frequently accessed together are stored together on the same physical node. The algorithm simply tries to create a graph with the tuples as nodes and the weights of the edges between them signify the frequency with which they are accessed together. To scale this graph representation, it uses various sampling/exclusion heuristics. 
  3. Database Privacy: Being a CloudDB, Relational Cloud uses CryptDB which introduces an interesting concept of adjustable security. CryptDB employs different encryption levels (RND, DET, OPE or HOM) for different types of operations on data. This enables the queries to be evaluated on encrypted data and sent back to the client for final decryption.

Comments/Critiques

Being more of a high-level paper, there was a lot of secret-sauce in the individual components that I would have been more interested in. For eg. It is unclear to me how the tuple graph would look like for OLAP workloads (as opposed to OLTP/Web workloads as described in the paper) and if OLAP workloads can even be easily partitioned on physical nodes as compared to their OLTP counterparts. As far as Kairos was concerned, the key component there is the Combined Load Predictor which models consolidated CPU, RAM and Disk for multiple workloads. It is unclear to me if it can predict sudden spikes or other non-historical workload characteristics. The key component that I found missing from the paper was a section on the effects of consolidation on individual query latency. As highlighted in the second paper that we read for the class, traditional DBs bind queries based on a static resource assumption. If this is true, the effect of Kairos mis-predictions on individual query latency is quite unclear from this paper.

Overall, I think Relational Cloud is a great initiative towards DBaaS model and is much better fit than its counterparts in its approach (Amazon RDS and Microsoft SQL Azure). However, the question I am unclear about is that do we really need a Relational DBaaS model in the cloud as opposed to weak consistency/entity-group based models?


Monday, October 17, 2011

PNUTS: Yahoo!'s Hosted Data Serving Platform

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. In PVLDB 2008.
PNUTS is a parallel, geographically distributed database that supports Yahoo!'s web applications. It chooses high-availability, low response times and scalability over strict consistency. The authors claim that since web applications typically manipulate one record-at-a-time, while different records may have activity in different geographic localities. To this end, they provide a per-record timeline consistency model, where all replicas of a given record apply all the updates made to it in the same order. This is achieved by having a per-record master to make sure that all updates are in order. Another key component of the design is the Pub-Sub Message System used for record-level asynchronous geographical replication which uses a (message broker based) guaranteed message delivery service rather than a persistent log. Lastly, PNUTS trusts the application designers to do the right thing in terms for their consistency needs by providing support for hashed and ordered table organizations, and more importantly API calls like read-any, read-latest, read-critical (version specific) etc.

Comments/Critiques:

PNUTS is an interesting design point in this space due to its simplicity by trusting the application designers to do the right thing. This somehow feels like an anti-Google system design philosophy in some sense (which always puts 'making the job of the application designer easy' as their top priority). However, it remains to be seen if application designers can write their applications by properly leveraging the trade-offs (e.g. where should they use read-latest vs. read-critical?). Secondly, it is unclear to me if their lack of support for referential integrity and complex queries (with joins etc.) is a major drawback or not for continuously evolving web applications. However, in all, I think PNUTS is definitely an interesting design and things like per-record timeline consistency would find its way in many future systems.