Introduction to Spark

Ganga Reddy
2 min readOct 19, 2018

--

What is Spark?

In wiki’s words — Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley‘s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

More Info can be found here (https://en.wikipedia.org/wiki/Apache_Spark).

An interesting research paper which explains RDDs (Resilient Distributed Datasets), a distributed memory abstraction based on which spark API is built, can be found here

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing

High-Level Overview of Spark Architecture

Spark Core, as the name suggests forms the foundation and provides functionalities for distributed task dispatching, scheduling, and basic I/O, exposed through an API supporting multiple languages including but not limited to Java, Python, Scala, and R.

EcoSystem

Spark ecosystem provides a wide range of tools suitable for a variety of applications. The main advantage of spark-over Hadoop is in-memory operations and performance improvement over iterative algorithms and batch processing as well.

Spark SQL: Introduces new abstraction called DataFrames for structured and semi-structured data which can be manipulated. Apart from the rich API provided in multiple languages, it provides SQL language support.

Spark Streaming: Supports streaming analytics by ingesting data in configurable mini-batches.

Spark ML: Provides a rich set of machine learning algorithms to solve regression, classification, clustering, recommendation among other ML problems.

Spark is built using Akka (Actor Kernel), a toolkit for building highly concurrent, distributed, and resilient message-driven applications for Java and Scala. Akka is inspired by actor models in Erlang but built on modern JVM.

--

--

No responses yet