Introduction to Spark
Overview
Time: minObjectives
Spark introduction
RDD
-
The beginning of Spark Originator: Matei Zaharia. Start in 2009 as a class project in UC Berkeley’s AMPlab. Need to do machine learning faster on HDFS
-
Doctoral dissertation (2013) http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-‐2014-‐12.pdf
-
Hear Matei talking https://www.youtube.com/watch?v=BFtQrfQ2rn0
-
What is Spark?
A general execution engine to improve/replace MapReduce. Spark’s operators are a superset of MapReduce
- What’s wrong with the original MapReduce?
Limitations of MapReduce.
- Originated around year 2000. Old technology.
- Designed for batch-processing large amount of webpages in Google
-
And it does that job very well!
- Not fit for
- Complex, multi-passing algorithms.
- Interactive ad-hoc queries
- Real‐time stream processing
- Core Spark data abstraction
- Resilient Distributed Dataset (RDD)
RDD — Resilient Distributed Dataset
The features of RDDs (decomposing the name):
- Resilient, i.e. fault-tolerant, so able to recompute missing or damaged partitions due to node failures.
- Distributed with data residing on multiple nodes in a cluster.
- Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).
Lazy execution
Key Points