Trying to decide which Apache Spark cluster managers are the right fit for your specific use case when deploying a Hadoop Spark Cluster on EC2 can be challenging. This post breaks down the general features of each solution and details the scheduling, HA (High Availability), security and monitoring for each option you have.

Apache Spark, an engine for large data processing, can be run in distributed mode on a cluster. Spark applications are run as independent sets of processes on a cluster, all coordinated by a central coordinator. This central coordinator can connect with three different cluster managers, Spark’s Standalone, Apache Mesos, and Hadoop YARN (Yet Another Resource Negotiator).

When running an application in distributed mode on a cluster, Spark uses a master/slave architecture and the central coordinator, also called the driver program, is the main process in your application, running the code that creates a SparkContext object. This driver process is responsible for converting a user application into smaller execution units called tasks. These tasks are then executed by executors which are worker processes that run the individual tasks.

In a cluster, there is a master and any number of workers. The driver program, which can run in an independent process, or in a worker of the cluster, requests executors from the cluster manager. It then schedules the tasks composing the application on the executors obtained from the cluster manager.   The cluster manager is responsible for the scheduling and allocation of resources across the host machines forming the cluster.

Spark is agnostic to the underlying cluster manager, all of the supported cluster managers can be launched on-site or in the cloud. All have options for controlling the deployment’s resource usage and other capabilities, and all come with monitoring tools.

So how do you decide which is the best cluster manager for your use case? To answer this question, we’ll begin with a quick overview and then look in more detail at the scaling capabilities, node management, High Availability (HA), security, and monitoring of each of the cluster managers.

Cluster Management Overview

Spark Standalone

The Spark Standalone cluster manager is a simple cluster manager available as part of the Spark distribution. It has HA for the master, is resilient to worker failures, has capabilities for managing resources per application, and can run alongside of an existing Hadoop deployment and access HDFS (Hadoop Distributed File System) data. The distribution includes scripts to make it easy to deploy either locally or in the cloud on Amazon EC2. It can run on Linux, Windows, or Mac OSX.

Apache Mesos

Apache Mesos, a distributed systems kernel, has HA for masters and slaves, can manage resources per application, and has support for Docker containers. It can run Spark jobs, Hadoop MapReduce, or any other service application. It has API’s for Java, Python, and C++.  It can run on Linux or Mac OSX.

Hadoop YARN

Hadoop YARN, a distributed computing framework for job scheduling and cluster resource management, has HA for masters and slaves, support for Docker containers in non-secure mode, Linux and Windows container executors in secure mode, and a pluggable scheduler. It can run on Linux and Windows.

Cluster Management Scheduling Capabilities

On all cluster managers, jobs or actions within a Spark application are scheduled by the Spark scheduler in a FIFO fashion. Alternatively, the scheduling can be set to a fair scheduling policy where Spark assigns resources to jobs in a round-robin fashion. In addition, the memory used by an application can be controlled with settings in the SparkContext. The resources used by a Spark application can be dynamically adjusted based on the workload. Thus, the application can free unused resources and request them again when there is a demand. This is available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.

Spark standalone uses a simple FIFO scheduler for applications. By default, each application uses all the available nodes in the cluster. The number of nodes can be limited per application, per user, or globally. Other resources, such as memory, cpus, etc. can be controlled via the application’s SparkConf object.

Apache Mesos has a master and slave processes. The master makes offers of resources to the application (called a framework in Apache Mesos) which either accepts the offer or not. Thus, claiming available resources and running jobs is determined by the application itself. Apache Mesos allows fine-grained control of the resources in a system such as cpus, memory, disks, and ports. Apache Mesos also offers course-grained control control of resources where Spark allocates a fixed number of CPUs to each executor in advance which are not released until the application exits. Note that in the same cluster, some applications can be set to use fine-grained control while others are set to use course-grained control.

Apache Hadoop YARN has a ResourceManager with two parts, a Scheduler, and an ApplicationsManager. The Scheduler is a pluggable component. Two implementations are provided, a CapacityScheduler, useful in a cluster shared by more than one organization, and the FairScheduler, which ensures all applications, on average, get an equal number of resources. Both schedulers assign applications to a queues and each queue gets resources that are shared equally between them. Within a queue, resources are shared between the applications. The ApplicationsManager is responsible for accepting job submissions and starting the application specific ApplicationsMaster. In this case, the ApplicationsMaster is the Spark application. In the Spark application, resources are specified in the application’s SparkConf object.

High Availability (HA)

The Spark standalone cluster manager supports automatic recovery of the master by using standby masters in a ZooKeeper quorum. It also supports manual recovery using the file system. The cluster is resilient to Worker failures regardless of whether recovery of the Master is enabled.

The Apache Mesos cluster  manager also supports automatic recovery of the master using Apache ZooKeeper. to enable recovery of the Master. Tasks which are currently executing continue to do so in the case of failover.

Apache Hadoop YARN supports manual recovery using a command line utility and supports automatic recovery via a Zookeeper-based ActiveStandbyElector embedded in the ResourceManager. Therefore, unlike Mesos and the Standalone managers, there is no need to run a separate ZooKeeper Failover Controller. ZooKeeper is only used to record the state of the ResourceManagers.


Spark supports authentication via a shared secret with all the cluster managers. The standalone manager requires the user configure each of the nodes with the shared secret. Data can be encrypted using SSL for the communication protocols. SASL encryption is supported for block transfers of data. Other options are also available for encrypting data. Access to Spark applications in the Web UI can be controlled via access control lists.

Mesos provides authentication for any entity interacting with the cluster. This includes the slaves registering with the master, frameworks (that is, applications) submitted to the cluster, and operators using endpoints such as HTTP endpoints. Each of these entities can be enabled to use authentication or not. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. Access control lists are used to authorize access to services in Mesos. By default, communication between the modules in Mesos is unencrypted. SSL/TLS can be enabled to encrypt this communication. HTTPS is supported for the Mesos WebUI.

Hadoop YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos. Service level authorization ensures that clients using Hadoop services are authorized to use them. Access to the Hadoop services can be finely controlled via access control lists. Additionally, data and communication between clients and services can be encrypted using SSL and data transferred between the Web console and clients with HTTPS.


Each Apache Spark application has a Web UI to monitor the application. The Web UI shows information about tasks running in the application, executors, and storage usage. Additionally, Spark’s standalone cluster manager has a Web UI to view cluster and job statistics as well as detailed log output for each job. If an application has logged events for its lifetime the Spark Web UI will automatically reconstruct the application’s UI after the application exists. If Spark is running on Mesos or YARN then a UI can be reconstructed after an application exits through Spark’s history server.

Apache Mesos provides numerous metrics for the master and slave nodes accessible via a URL. These metrics include, for example, percentage and number of allocated cpu’s, total memory used, percentage of available memory used, total disk space, allocated disk space, elected master, uptime of a master, slave registrations, connected slaves, etc. Also, per container network monitoring and isolation is supported.

Hadoop YARN has a Web UI for the ResourceManager and the NodeManager. The ResourceManager UI provides metrics for the cluster while the NodeManager provides information for each node and the applications and containers running on the node.

So, which cluster manager is best for your project?

Apache Spark is agnostic to the underlying cluster manager so choosing which manager to use depends on your goals. In the sections above we discussed several aspects of Spark’s Standalone cluster manager, Apache Mesos, and Hadoop YARN including:

  • Scheduling
  • High Availability
  • Security
  • Monitoring

All three cluster managers provide various scheduling capabilities but Apache Mesos provides the finest grained sharing options.

High availability is offered by all three cluster managers but Hadoop YARN doesn’t need to run a separate ZooKeeper Failover Controller.

Security is provided on all of the managers. Apache Mesos uses a pluggable architecture for its security module with the default module using Cyrus SASL. The Standalone cluster manager uses a shared secret and Hadoop YARN uses Kerberos. All three use SSL for data encryption.

Finally, the Apache Standalone Cluster Manager is the easiest to get started with and provides a fairly complete set of capabilities. The scripts are simple and straightforward to use. So, if developing a new application this is the quickest way to get started.

And if you need help, AgilData is here for you! AgilData provides professional Big Data services to help organizations make sense of their Big Data.

Cassandra Database – Inserting and updating data into a List and Map



Using Cassandra Database, if a table specifies a list to hold data, then either INSERT or UPDATE is used to enter data.

Here is how to modify data into value type list .

Working with Cassandra List

In the other hand, if the table specifies a map to hold data, this link could help .

Working with Cassandra Maps


Multi-Column Key and Value – Reduce a Tuple in Spark

In many tutorials key-value is typically a pair of single scalar values, for example (‘Apple’, 7). But key-value is a general concept and both key and value often consist of multiple fields, and they both can be non-unique.

Consider a typical SQL statement:

SELECT store, product, SUM(amount), MIN(amount), MAX(amount), SUM(units)
FROM sales
GROUP BY store, product

Columns store and product can be considered as a key, and columns amount and units as values.

Let’s implement this SQL statement in Spark. Firstly we define a sample data set:


val sales=sc.parallelize(List(
   ("West""Apple"2.0, 10),
   ("West""Apple"3.0, 15),
   ("West""Orange", 5.0, 15),
   ("South", "Orange", 3.0, 9),
   ("South", "Orange", 6.0, 18),
   ("East""Milk",   5.0, 5)))
The Spark/Scala code equivalent to the SQL statement is as follows:
sales.map{ case (store, prod, amt, units) => ((store, prod), (amt, amt, amt, units)) }.
  reduceByKey((x, y) =>
   (x._1 + y._1, math.min(x._2, y._2), math.max(x._3, y._3), x._4 + y._4)).collect
The result of the execution (formatted):
res: Array[((String, String), (Double, Double, Double, Int))] = Array(
  ((West, Orange), (5.0, 5.0, 5.0, 15)), 
  ((West, Apple),  (5.0, 2.0, 3.0, 25)), 
  ((East, Milk),   (5.0, 5.0, 5.0, 5)), 
  ((South, Orange),(9.0, 3.0, 6.0, 27)))

How It Works

We have an input RRD sales containing 6 rows and 4 columns (String, String, Double, Int). The first step is to define which columns belong to the key and which to the value. You can use map function on RDD as follows:
sales.map{ case (store, prod, amt, units) => ((store, prod), (amt, amt, amt, units)) }

We defined a key-value tuple where key is also tuple containing (store, prod) and value is tuple containing the final results we are going to calculate (amt, amt, amt, units)

Note that we initialized SUM, MIN and MAX with amt, so if there is only one row in a group then SUM, MIN, MAX values will be the same and equal to amt.

The next step is to reduce values by key:

reduceByKey((x, y) =>

In this function x is the result of reduction of 2 previous values, and y is the current value. Remember that both x and yare tuples containing (amt, amt, amt, units).

Reduce is an associative operation and it works similar to adding 2 + 4 + 3 + 5 + 6 + … You take first 2 values, add them, then take 3rd values, add it and so on.

Now it is easier to understand the meaning of:

reduceByKey((x, y) =>
  (x._1 + y._1, math.min(x._2, y._2), math.max(x._3, y._3), x._4 + y._4)).collect
Sum of the previous and current amounts (_1 means the first item of a tuple):
x._1 + y._1
Selecting MIN amount:
math.min(x._2, y._2)
Selecting MAX amount:
math.max(x._3, y._3)
Sum of the previous and current units:
x._4 + y._4

UnderstanSpark Streaming


Spark Streaming was added to Apache Spark in 2013, an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources like Kafka, Flume, and Amazon Kinesis. Its key abstraction is a Discretized Stream or, in short, a DStream, which represents a stream of data divided into small batches. DStreams are built on RDDs, Spark’s core data abstraction. This allows Spark Streaming to seamlessly integrate with any other Spark components like MLlib and Spark SQL.

Spark Streaming


Spark Introduction

Interesting introduction of Spark  (http://spark.apache.org/)

Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark’s RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.