This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to Big Data with Spark Part 1

Setup

Overview

Time: min
Objectives
  • Install PySpark using command !pip install pyspark in google colab or Jupyter

  • Install and setup Zoom if needed

  • Download data files for this workshop

Data files:

Please download the following file(s) to particpate in the workshop:

Link to Data Set: https://uofi.box.com/s/1kkchfi519vm8sqpsfgc795l9lkw2zln

Link to IPython File-1: https://uofi.box.com/s/vno8qz8pm6sv45aebupoz83tcp45uyq5

About the Data Used in this Workshop:

The dataset used is a sample insurance dataset.

Attributes of this data set:

Key Points


Introduction

Overview

Time: 0 min
Objectives
  • Introduction to Big data

  • Introduction to Spark Framework

FIXME

Key Points


Introduction to Big Data

Overview

Time: min
Objectives
  • Defining Big data and handling the Bigdata

Big data is a term that describes the large volume of data–both structured and unstructured–that inundates a business on a day-to-day basis.But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves

image

Screen Shot 2022-04-05 at 10 01 55 AM

Screen Shot 2022-04-05 at 10 03 27 AM

Screen Shot 2022-04-05 at 10 03 58 AM

Screen Shot 2022-04-05 at 10 05 38 AM

Key Points


Introduction to Hadoop

Overview

Time: min
Objectives
  • Learn about hadoop as a platform to solve the Bidgata Analytics

Handles thousands of nodes and petabytes of data Hadoop subprojects:

image

Goals / Requirements:

Screen Shot 2022-04-05 at 10 10 55 AM

Key Points


Introduction to Spark

Overview

Time: min
Objectives
  • Spark introduction

  • RDD

Screen Shot 2022-04-05 at 10 14 13 AM

A general execution engine to improve/replace MapReduce. Spark’s operators are a superset of MapReduce

Limitations of MapReduce.

Screen Shot 2022-04-05 at 10 18 47 AM

Screen Shot 2022-04-05 at 10 19 05 AM

image

RDD — Resilient Distributed Dataset

The features of RDDs (decomposing the name):

image

Lazy execution

image

Key Points


Spark SQL

Overview

Time: min
Objectives
  • Introduction to Spark SQL

Goals for Spark SQL

DataFrames are a recent addition to Spark (early 2015).

The DataFrames API:

See databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Screen Shot 2022-04-05 at 11 40 09 AM

Screen Shot 2022-04-05 at 11 40 32 AM

Key Points