Setup

Overview

Time: min

Objectives

Install PySpark using command !pip install pyspark in google colab or Jupyter

Install and setup Zoom if needed

Download data files for this workshop

Data files:

Please download the following file(s) to particpate in the workshop:

Link to Data Set: https://uofi.box.com/s/1kkchfi519vm8sqpsfgc795l9lkw2zln

Link to IPython File-1: https://uofi.box.com/s/vno8qz8pm6sv45aebupoz83tcp45uyq5

About the Data Used in this Workshop:

The dataset used is a sample insurance dataset.

Attributes of this data set:

PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

Key Points

Introduction

Overview

Time: 0 min

Objectives

Introduction to Big data

Introduction to Spark Framework

FIXME

Key Points

Introduction to Big Data

Overview

Time: min

Objectives

Defining Big data and handling the Bigdata

Big data is a term that describes the large volume of data–both structured and unstructured–that inundates a business on a day-to-day basis.But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves

To get better understanding of what big data is, it is often described using 5 Vs. (many people believe there are 4 Vs, but I think 5 Vs are more appropriate)

Volume refers to the vast amount of data generated every second.

Screen Shot 2022-04-05 at 10 01 55 AM

Velocity refers to the speed at which new data is generated and the speed at which data moves around.
Variety refers to the different types of data we can now use

Screen Shot 2022-04-05 at 10 03 27 AM

Veracity refers to messiness or trustworthiness of data.

Screen Shot 2022-04-05 at 10 03 58 AM

There is another V to take into account when looking at big data: Value.

Screen Shot 2022-04-05 at 10 05 38 AM

Key Points

Introduction to Hadoop

Overview

Time: min

Objectives

Learn about hadoop as a platform to solve the Bidgata Analytics

An open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

Handles thousands of nodes and petabytes of data Hadoop subprojects:

MapReduce: A software framework for distributed processing of large data sets on computer clusters and some others (e.g, HIVE, HBASE)…
HDFS: Hadoop Distributed File System with high throughput access to application data

Goals / Requirements:

Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
Structured and non-structured data
Simple programming models
High scalability and availability
Use commodity (cheap!) hardware with little redundancy
Fault-tolerance
Move computation rather than data

Screen Shot 2022-04-05 at 10 10 55 AM

Key Points

Introduction to Spark

Overview

Time: min

Objectives

Spark introduction

RDD

Screen Shot 2022-04-05 at 10 14 13 AM

The beginning of Spark Originator: Matei Zaharia. Start in 2009 as a class project in UC Berkeley’s AMPlab. Need to do machine learning faster on HDFS
Doctoral dissertation (2013) http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-‐2014-‐12.pdf
Hear Matei talking https://www.youtube.com/watch?v=BFtQrfQ2rn0
What is Spark?

A general execution engine to improve/replace MapReduce. Spark’s operators are a superset of MapReduce

What’s wrong with the original MapReduce?

Limitations of MapReduce.

Originated around year 2000. Old technology.
Designed for batch-processing large amount of webpages in Google
And it does that job very well!
Not ﬁt for
Complex, multi-passing algorithms.
Interactive ad-hoc queries
Real‐time stream processing

Screen Shot 2022-04-05 at 10 18 47 AM

Screen Shot 2022-04-05 at 10 19 05 AM

Core Spark data abstraction
Resilient Distributed Dataset (RDD)

RDD — Resilient Distributed Dataset

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant, so able to recompute missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

Lazy execution

Key Points

Spark SQL

Overview

Time: min

Objectives

Introduction to Spark SQL

A new module in Apache Spark that integrates relational processing with Spark’s functional programming API.
Offers much tighter integration between relational and procedural in processing
Includes a highly extensible optimizer, Catalyst, that makes it easy to add data sources, optimization rules, and data types.

Goals for Spark SQL

Support relational processing both within Spark programs and external data sources using a programmer-friendly API.
Provide high performance using established DBMS techniques.
Easily support new data sources, including semi-structured data and external databases amenable to query federation.
Enable extension with advanced analytics algorithms such as graph processing and machine learning.
What are DataFrames?

DataFrames are a recent addition to Spark (early 2015).

The DataFrames API:

is intended to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing
is inspired by data frames in R and Python (Pandas)
designed from the ground-up to support modern big data and data science applications
an extension to the existing RDD API

See databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Screen Shot 2022-04-05 at 11 40 09 AM

Screen Shot 2022-04-05 at 11 40 32 AM

Key Points

Introduction to Big Data with Spark Part 1

Setup

Overview

Data files:

About the Data Used in this Workshop:

Key Points

Introduction

Overview

Key Points

Introduction to Big Data

Overview

Key Points

Introduction to Hadoop

Overview

Key Points

Introduction to Spark

Overview

Key Points

Spark SQL

Overview

Key Points