Hadoop Essentials
3 min readJun 1, 2021
Hadoop is a framework for distributed processing of large datasets across clusters of commodity computers.
Core Hadoop :
- HDFS : Reliable Shared storage
- MapReduce : Distributed Computation
Distribution type including Hadoop :
Cloudera, HortonWorks, MapR
HDFS (Hadoop Distributed File System) :
- NameNodes : metadata management and bandwidth optimization by determining the distribution in the DataNodes.
- DataNodes : storage of the data itself
- Files divided into distributed blocks and replicated (replication factor to be determined) on the DataNodes.
YARN :
It is the cluster resource manager and manages the interface between HDFS and the various processing frameworks (Spark, MapReduce, Impala, …)
MapReduce :
Parallel processing on DataNodes per file block.
HUE :
Web interface for file management
Sqoop :
Link between HDFS and RDBMS databases (import to HDFS and export to database).
Flume :