Overview of Apache Spark and How it Works

Developed by Romanian-Canadian Big Data computer scientist Matei Zaharia, Apache Spark is an open-source cluster-computing framework that's widely used for Big Data. It offers an interface through which programmers can program entire clusters with fault tolerance and data parallelism. To learn more about Spark and the many real-world applications that it offers, keep reading.

History of Apache Spark

Apache Spark began as a class project in UC Berkeley's AMPLab in 2009. During an interview with Ion Stoica, UC Berkeley professor and databricks CEO, it was revealed the project's goal was to create a cluster management framework that was capable of supporting different kinds of cluster computing systems. Following the creation of Mesos, AMPLab students wanted to take the framework one step further, resulting in the creation of Spark.

Of course, there were other frameworks already being used, so students wanted to make a unique framework that was different. One of the students, Matei Zaharia, decided to use a machine learning framework instead of the targeted batch processing found in Hadoop. This essentially made Spark substantially more effective for use in Big Data applications, which is one of its characteristic traits.

Although Spark was created in 2009, it wasn't until the following year when it was open sourced under the BSD license – a type of permissive-free license that's used for the distribution of software. In 2013, Spark was donated to Apache, at which point its license was switched from BSD to Apache 2.0. Today, Spark remains one of the most widely used frameworks in Big Data processing and projects. Last year, for instance, it had more than 1,000 contributors, attesting to its widespread popularity.

About Apache Spark

Due to its exceptional speed and efficiency, Spark has become the preferred framework for use in Big Data projects. It offers a unified framework for the management of Big Data processing, complete with a variety of datasets such as text, graphs, batch, real-time steaming, etc.

By default, Spark will store data in memory before storing it to the disk. It can even store parts of a data set in the memory and the rest on disk. This allows for faster speeds and greater performance.

Just how fast is the Spark framework? Some reports claim that it allows the execution of Hadoop clusters to run 100 times faster in memory. For clusters running on disk, it's able to boost speeds by as much as 10 times. This alone is reason enough for many IT organizations and Big Data analysts to choose Spark over other common frameworks.

Spark is build around a specific type of data structure called “resilient distributed dataset (RDD).” This read-only multiset of data is distributed over cluster machines that are maintained in a fault-tolerant method. Spark also allows users to write applications in Scala, Python or Java, all with the convenience of accessing more than 80 operators. Some of the many operations supported by this framework include SQL, Map and Reduce.

Spark vs Hadoop

Hadoop has been around for over a decade, offering a functional means of processing Big Data via MapReduce. The problem with Hadoop MapReduce, however, is that it's inefficient at processing cases that involve mutli-pass algorithms. Data processing workflows require one Map phase and one Reduce phase, at which point they must be converted. Long story short, Spark is a better solution for projects such as these, as it supports the use of multi-step data pipelines via directed acyclic graph (DAG) patterns.

Even hadoop batch jobs were like real time systems with a delay of 20-30 mins. So Spark, with aggressive in memory usage, we were able to run same batch processing systems in under a min. Then we started to think, if we can run one job so fast, it will be nice to have multiple jobs running in a sequence to solve particular pipeline under very small time interval,” explained Ion Stoica, UC Berkeley professor and databricks CEO.

Spark Libraries

In addition to the default Spark Core API, there are several other libraries of which the ecosystem is made. These include the following:

  • Spark Streaming – based on micro batch computing, this library is used for the processing of real-time streaming data.

  • Spark SQL – used to expose datasets over DDBC API, and to allow the execution o SQL queries on Spark.

  • Spark Mllib – scalable machine-learning library that features algorithms, utilities, classifications, clustering, and optimization.

  • Spark GraphX – a new type of Spark API that's used specifically for graphs and graph computation. It enhances the Spark RDD with the introduction of the Resilient Distributed Property Graph.

Cluster Manager and Distributed Storage System

There are two primary requirements of Apache Spark: a cluster manager, and a distributed storage system. There are several different solutions available for both of requirements. For the cluster manager, programmers can choose from the native Spark cluster, Hadoop YARN, or Apache Mesos. For the distributed storage system, some of the available options include the Hadoop Distributed File System (HDFS), MapR File System, Cassandra, Amazon S3, Kudu or OpenStack Swift. Alternatively, a custom distributed storage system can be built and used. Regardless, the Spark framework requires both a cluster manager and distributed storage system of some type to function.

Installing Spark

If you are interested in trying out Spark, you can either install it as a stand-alone framework, or use a Spark Virtual Machine image that's available from a third-party vendor. Once the framework has been installed, you can connect to it via the Spark shell, which is available in Scala and Python (note: Spark shell is not available in Java, although this feature may be added later).

To connect to Spark in shell, run the command spark-shell.cmd. To connect in Python, run the command pyspark.cmd. When running Spark, regardless mode, you can view metrics via the following URL http://localhost:4040, which brings up the Spark Web Console. From here, you can access a wide range of different statistics and information about your jobs.

Check out the official Apache Spark website at http://spark.apache.org/ to learn more about this popular Big Data framework.

Thanks for reading and feel free to let us know your thoughts in the comments below regarding Apache Spark.