Getting Started with Hadoop
Learn all about the ecosystem and get started with Hadoop today. Choose where to begin, learn at your own pace:
Hadoop Concepts
Learn from scratch and master Hadoop
- Hadoop Tutorial
- History of Hadoop
- What is New in Hadoop 3?
- Features Of Hadoop
- Hadoop Ecosystem
- Hadoop Architecture
- Advantages and Disadvantages
- Hadoop Analytics Tools
- How Hadoop Works Internally
- Hadoop Commands
- Hadoop getmerge Command
- Hadoop copyFromLocal Command
- Hadoop High Availability
- Hadoop Schedulers
- Distributed Cache in Hadoop
- Hadoop automatically Failover
- Limitations of Hadoop
- HBase Compaction
- Hadoop 2.6 Multi Node Cluster
- Spark Hadoop Cloudera Certifications
- Hadoop Career Growth
- Future of Hadoop
- Hadoop Job Roles
- Hadoop Developer Salary
- Hadoop Books
- Best Hadoop Books
- Best Hadoop Administration Books
Hadoop Advanced Concepts
- Hadoop Ecosystem Infographic
- Hadoop Ecosystem
- Introduction to Hadoop Security
- Hadoop MapReduce Tutorial
- Kafka Hadoop Integration
- Big Data Use Cases – Case Studies
- What is Hadoop Cluster?
- Hadoop Cluster
- Hadoop Streaming
- Hadoop Mapper in MapReduce
- Big Data Terminologies
- Hadoop Applications
- Hadoop and Data Warehouse
- Why Hadoop is Important?
- Big Data and Hadoop Job Opportunities
- Setup Hadoop CDH3 on Ubuntu
Hadoop Installation
- Install Hadoop on Single Machine
- Install Hadoop on Ubuntu
- Install Hadoop on Centos
- Install Pivotal Hadoop v-2 Cluster in production
- Install Hadoop 1.x on multi-node cluster
- Install Hadoop 2 on Ubuntu
- Install Hadoop 2 on Ubuntu 16.0.4
- Install Hadoop 2 with YARN
- Install YARN with Hadoop 2
- Install Hadoop 2.7 on Ubuntu
- Install Hadoop 3 on Ubuntu
HDFS
Move ahead to HDFS
- Introduction to HDFS
- Apache Hadoop HDFS Tutorial
- HDFS Architecture
- Features of HDFS
- HDFS Read-Write Operations
- HDFS Data Read Operation
- HDFS Data Write Operation
- HDFS Commands- Part 1
- HDFS Commands- Part 2
- HDFS Commands- Part 3
- HDFS Commands- Part 4
- HDFS Data Blocks
- HDFS Rack Awareness
- HDFS High Availability
- HDFS NameNode High Availability
- HDFS Federation- Architecture & Benefits
- HDFS Disk Balancer
- Erasure Coding in HDFS
- Fault Tolerance in HDFS
MapReduce
Get to know MapReduce
- Introduction to MapReduce
- MapReduce Data Flow
- How Hadoop MapReduce Works
- MapReduce Mapper
- MapReduce Reducer
- MapReduce Key-Value Pairs
- MapReduce InputFormat
- MapReduce InputSplit
- MapReduce RecordReader
- MapReduce Partitioner
- MapReduce Combiner
- Shuffling-Sorting in MapReduce
- MapReduce OutputFormat
- MapReduce InputSplit vs Blocks
- MapReduce Map Only Job
- Data Locality in MapReduce
- MapReduce Speculative Execution
- Counters in MapReduce
- MapReduce Job Optimization
- Performance Tuning in MapReduce
- Apache Spark vs Hadoop MapReduce
Hive
Step into the world of Hive
- Introduction to Apache Hive
- A Comprehensive Guide to Apache Hive
- Hive Environment Setup- Ubuntu
- Hive Features and Limitations
- Apache Hive Architecture
- Apache Hive Data Types
- Apache Hive Built-in Operators
- Built-In Functions in Hive
- User-Defined Functions (UDF) in Hive
- Hive DDL Commands and Types
- Views and Indexes in Hive
- Configuring Hive Metastores
- Developing Data Models in Hive
- Hive Custom and Built-in SerDe
- Hive Data Partitioning
- Bucketing in Hive
- Hive Partitioning vs Bucketing
- Apache Hive Joins and Types
- Map Join in Hive
- Bucket Map Join in Hive
- Skew Join in Hive
- Hive SMB (Sort Merge Bucket) Join
- Hive Internal vs External Tables
- Configuring Hive Metastore to MySQL
- HiveQL (Hive Query Language) Select Statement
- HiveQL Group By Clause
- HiveQL Order By Clause
- 7 Best Hive Optimization Techniques
- HBase vs Hive
- Pig vs Hive
- Impala vs Hive
- Best Hive Books
Impala
Dive into Apache Impala
- Introduction to Impala
- Impala Environment Setup
- Features of Impala
- Impala Architecture
- Impala Use Cases
- Impala Built-in Functions
- Impala User Defined Functions (UDF)
- Impala Data Types
- Comments in Impala
- Introduction to Impala SQL (Impala Query Language)
- Selecting a Database with Hue Browser- Impala SQL
- CREATE DATABASE in Impala SQL
- DROP DATABASE in Impala SQL
- DESCRIBE Statement in Impala SQL
- SELECT Statement in Impala SQL
- CREATE TABLE Statement in Impala SQL
- DROP TABLE Statement in Impala SQL
- INSERT Statement in Impala SQL
- TRUNCATE TABLE Statement in Impala SQL
- SHOW Statement in Impala SQL
- CREATE VIEW Statement in Impala SQL
- DROP VIEW Statement in Impala SQL
- ALTER VIEW Statement in Impala SQL
- ALTER TABLE Statement in Impala SQL
- ORDER BY Clause in Impala SQL
- GROUP BY Clause in Impala SQL
- LIMIT Clause in Impala SQL
- HAVING Clause in Impala SQL
- WITH Clause in Impala SQL
- UNION Clause in Impala SQL
- OFFSET Clause in Impala SQL
- DISTINCT Operator in Impala SQL
- Impala Shell Commands
- Troubleshooting Performance Tuning in Impala
- Impala Security Guidelines
- Pros and Cons of Impala
- Best Impala Books
HBase
Play around with HBase
- Introduction to HBase
- Features of HBase
- HBase Architecture
- HBase Pros & Cons
- HBase Use Cases
- HBase Shell Commands and Usage
- HBase Read & Write Operations
- HBase Commands to Define and Manipulate Data
- HBase Table Management Commands
- HBase Data Manipulation Commands- Create, Truncate, Scan
- HBase Admin API
- HBase Client API
- HBase MemStore Configuration and Benefits
- HBase Optimization: Performance Tuning
- HBase Compaction and Data Locality in Hadoop
- A Comprehensive Guide to Apache HBase
- HBase + MapReduce Integration
- HBase vs RDBMS
- HBase vs Impala
- HBase vs Hive
- HBase Security: Kerberos Authentication and Authorization
- Troubleshooting in HBase
- HBase Career Opportunities
- Best HBase Books
Pig
Say Hello to Apache Pig
- Introduction to Pig
- Pig Environment Setup
- Apache Pig Features
- Apache Pig Architecture
- A Comprehensive Guide to Apache Pig
- Pros and Cons of Pig
- Pig Architecture & Execution Modes
- Pig Grunt Shell Commands
- Pig Built-in Functions
- User-Defined Functions in Pig
- Introduction to Pig Latin
- Pig Latin Operators and Statements
- Executing Apache Pig Scripts
- Reading and Storing Pig Data and Operators
- Apache Pig Execution Modes and Mechanisms
- Pig Career Opportunities
- Pig vs Hive
- Best Pig Books
Flume
Discover Apache Flume
- Introduction to Flume
- Flume Environment Setup- Ubuntu
- Flume Architecture
- Flume Features & Limitations
- Use Cases of Flume
- Flume Source
- Flume Sink
- Flume Sink Processors
- Flume Channel Selectors
- Flume Channel
- Flume Event Serializers
- Flume Interceptors
- Flume Data Flow
- Flume Data Transfer to HDFS
- Flume Troubleshooting
- Best Flume Books
Sqoop
Break the ice with Sqoop
- Introduction to Sqoop
- Sqoop Environment Setup
- Sqoop Features
- Sqoop Architecture
- Importing Data from RDBMS to HDFS- Sqoop
- Exporting Data from HDFS to RDBMS- Sqoop
- Sqoop Eval- Commands and Query Evaluation
- Sqoop import-all-tables
- Sqoop Validation- Interfaces and Limitations
- Sqoop Codegen Arguments and Commands
- Combining Datasets with Sqoop Merge
- Sqoop Metastore Tool
- Sqoop Troubleshooting Tips & Known Issues
- Sqoop List Tables and their Arguments
- Sqoop List Databases and Syntax
- Creating and Executing Jobs in Sqoop
- Sqoop Connectors & Drivers (JDBC)
- Sqoop Import Mainframe Tool
- Databases Supported in Sqoop
- Sqoop + HCatalog Integration
- Sqoop vs Flume
- Best Sqoop Books
ZooKeeper
Dig Deeper into ZooKeeper
- Introduction to ZooKeeper
- ZooKeeper Features
- ZooKeeper Architecture
- ZooKeeper Workflow
- Terminologies of ZooKeeper
- ZooKeeper Applications
- Pros and Cons of ZooKeeper
- ZooKeeper Data Model
- ZooKeeper Znode
- Leader Election in ZooKeeper
- ZooKeeper CLI (Command Line Interface)
- ZooKeeper Access Control with ACLs
- ZooKeeper API- Java & C Bindings
- ZooKeeper Sessions
- ZooKeeper Queues- Priority and Producer-Consumer
- ZooKeeper Locks- Shared and Recoverable Shared
- ZooKeeper Watches, Features, and Guarantees
- ZooKeeper Barriers and Double Barriers
- Best ZooKeeper Books
Exploring the Ecosystem
Let’s take a look at some facts about Hadoop and the entire ecosystem.
Hadoop first showed up in December of 2011, although Doug Cutting and Mike Cafarella conceived it in their paper “Google File System”in October of 2003. It is a collection of open-source software tools that allow using a network of many computers to solve problems involving massive amounts of data and computation. It delivers a software framework for distributed storage and processing of big data using MapReduce. The entire Hadoop Ecosystem is made of a layer of components that operate swiftly with each other. These are AVRO, Ambari, Flume, HBase, HCatalog, HDFS, Hadoop, Hive, Impala, MapReduce, Pig, Sqoop, YARN, and ZooKeeper.
Doug Cutting
Mike Cafarella