Introduction to Hadoop
Overview
Time: minObjectives
Learn about hadoop as a platform to solve the Bidgata Analytics
- An open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
- It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
Handles thousands of nodes and petabytes of data Hadoop subprojects:
- MapReduce: A software framework for distributed processing of large data sets on computer clusters and some others (e.g, HIVE, HBASE)…
- HDFS: Hadoop Distributed File System with high throughput access to application data
Goals / Requirements:
- Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
- Structured and non-structured data
- Simple programming models
- High scalability and availability
- Use commodity (cheap!) hardware with little redundancy
- Fault-tolerance
- Move computation rather than data
Key Points