Setup
Overview
Time: minObjectives
Install PySpark using command !pip install pyspark in google colab or Jupyter
Install and setup Zoom if needed
Download data files for this workshop
Data files:
Please download the following file(s) to particpate in the workshop:
Link to Data Set: https://uofi.box.com/s/1kkchfi519vm8sqpsfgc795l9lkw2zln
Link to IPython File-1: https://uofi.box.com/s/vno8qz8pm6sv45aebupoz83tcp45uyq5
About the Data Used in this Workshop:
The dataset used is a sample insurance dataset.
Attributes of this data set:
- PassengerId
- Survived
- Pclass
- Name
- Sex
- Age
- SibSp
- Parch
- Ticket
- Fare
- Cabin
- Embarked
Key Points
Introduction
Overview
Time: 0 minObjectives
Introduction to Big data
Introduction to Spark Framework
FIXME
Key Points
Introduction to Big Data
Overview
Time: minObjectives
Defining Big data and handling the Bigdata
Big data is a term that describes the large volume of data–both structured and unstructured–that inundates a business on a day-to-day basis.But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves
- To get better understanding of what big data is, it is often described using 5 Vs. (many people believe there are 4 Vs, but I think 5 Vs are more appropriate)
- Volume refers to the vast amount of data generated every second.
-
Velocity refers to the speed at which new data is generated and the speed at which data moves around.
-
Variety refers to the different types of data we can now use
- Veracity refers to messiness or trustworthiness of data.
- There is another V to take into account when looking at big data: Value.
Key Points
Introduction to Hadoop
Overview
Time: minObjectives
Learn about hadoop as a platform to solve the Bidgata Analytics
- An open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
- It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
Handles thousands of nodes and petabytes of data Hadoop subprojects:
- MapReduce: A software framework for distributed processing of large data sets on computer clusters and some others (e.g, HIVE, HBASE)…
- HDFS: Hadoop Distributed File System with high throughput access to application data
Goals / Requirements:
- Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
- Structured and non-structured data
- Simple programming models
- High scalability and availability
- Use commodity (cheap!) hardware with little redundancy
- Fault-tolerance
- Move computation rather than data
Key Points
Introduction to Spark
Overview
Time: minObjectives
Spark introduction
RDD
-
The beginning of Spark Originator: Matei Zaharia. Start in 2009 as a class project in UC Berkeley’s AMPlab. Need to do machine learning faster on HDFS
-
Doctoral dissertation (2013) http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-‐2014-‐12.pdf
-
Hear Matei talking https://www.youtube.com/watch?v=BFtQrfQ2rn0
-
What is Spark?
A general execution engine to improve/replace MapReduce. Spark’s operators are a superset of MapReduce
- What’s wrong with the original MapReduce?
Limitations of MapReduce.
- Originated around year 2000. Old technology.
- Designed for batch-processing large amount of webpages in Google
-
And it does that job very well!
- Not fit for
- Complex, multi-passing algorithms.
- Interactive ad-hoc queries
- Real‐time stream processing
- Core Spark data abstraction
- Resilient Distributed Dataset (RDD)
RDD — Resilient Distributed Dataset
The features of RDDs (decomposing the name):
- Resilient, i.e. fault-tolerant, so able to recompute missing or damaged partitions due to node failures.
- Distributed with data residing on multiple nodes in a cluster.
- Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).
Lazy execution
Key Points
Spark SQL
Overview
Time: minObjectives
Introduction to Spark SQL
- A new module in Apache Spark that integrates relational processing with Spark’s functional programming API.
- Offers much tighter integration between relational and procedural in processing
- Includes a highly extensible optimizer, Catalyst, that makes it easy to add data sources, optimization rules, and data types.
Goals for Spark SQL
- Support relational processing both within Spark programs and external data sources using a programmer-friendly API.
- Provide high performance using established DBMS techniques.
- Easily support new data sources, including semi-structured data and external databases amenable to query federation.
-
Enable extension with advanced analytics algorithms such as graph processing and machine learning.
- What are DataFrames?
DataFrames are a recent addition to Spark (early 2015).
The DataFrames API:
- is intended to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing
- is inspired by data frames in R and Python (Pandas)
- designed from the ground-up to support modern big data and data science applications
- an extension to the existing RDD API
See databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Key Points