Spark SQL
Overview
Time: minObjectives
Introduction to Spark SQL
- A new module in Apache Spark that integrates relational processing with Spark’s functional programming API.
- Offers much tighter integration between relational and procedural in processing
- Includes a highly extensible optimizer, Catalyst, that makes it easy to add data sources, optimization rules, and data types.
Goals for Spark SQL
- Support relational processing both within Spark programs and external data sources using a programmer-friendly API.
- Provide high performance using established DBMS techniques.
- Easily support new data sources, including semi-structured data and external databases amenable to query federation.
-
Enable extension with advanced analytics algorithms such as graph processing and machine learning.
- What are DataFrames?
DataFrames are a recent addition to Spark (early 2015).
The DataFrames API:
- is intended to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing
- is inspired by data frames in R and Python (Pandas)
- designed from the ground-up to support modern big data and data science applications
- an extension to the existing RDD API
See databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Key Points