Introduction to Spark

Overview

Time: min

Objectives

Spark introduction

RDD

Screen Shot 2022-04-05 at 10 14 13 AM

The beginning of Spark Originator: Matei Zaharia. Start in 2009 as a class project in UC Berkeley’s AMPlab. Need to do machine learning faster on HDFS
Doctoral dissertation (2013) http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-‐2014-‐12.pdf
Hear Matei talking https://www.youtube.com/watch?v=BFtQrfQ2rn0
What is Spark?

A general execution engine to improve/replace MapReduce. Spark’s operators are a superset of MapReduce

Limitations of MapReduce.

Screen Shot 2022-04-05 at 10 18 47 AM

Screen Shot 2022-04-05 at 10 19 05 AM

RDD — Resilient Distributed Dataset

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant, so able to recompute missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

Lazy execution

Key Points

Overview