This lesson is still being designed and assembled (Pre-Alpha version)

previous episode

Introduction to Big Data with Spark Part 1

next episode

Introduction to Hadoop

Overview

Time: min

Objectives

Learn about hadoop as a platform to solve the Bidgata Analytics

An open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

Handles thousands of nodes and petabytes of data Hadoop subprojects:

MapReduce: A software framework for distributed processing of large data sets on computer clusters and some others (e.g, HIVE, HBASE)…
HDFS: Hadoop Distributed File System with high throughput access to application data

Goals / Requirements:

Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
Structured and non-structured data
Simple programming models
High scalability and availability
Use commodity (cheap!) hardware with little redundancy
Fault-tolerance
Move computation rather than data

Screen Shot 2022-04-05 at 10 10 55 AM

Key Points

previous episode

next episode