Hadoop Essentials

Serigne DIAW
3 min readJun 1, 2021

Hadoop is a framework for distributed processing of large datasets across clusters of commodity computers.

Core Hadoop :

  • HDFS : Reliable Shared storage
  • MapReduce : Distributed Computation

Distribution type including Hadoop :

Cloudera, HortonWorks, MapR

HDFS (Hadoop Distributed File System) :

  • NameNodes : metadata management and bandwidth optimization by determining the distribution in the DataNodes.
  • DataNodes : storage of the data itself
  • Files divided into distributed blocks and replicated (replication factor to be determined) on the DataNodes.

YARN :

It is the cluster resource manager and manages the interface between HDFS and the various processing frameworks (Spark, MapReduce, Impala, …)

MapReduce :

Parallel processing on DataNodes per file block.

HUE :

Web interface for file management

Sqoop :

Link between HDFS and RDBMS databases (import to HDFS and export to database).

Flume :

--

--

Serigne DIAW
Serigne DIAW

Written by Serigne DIAW

Data Engineer / Data Architect / Data Scientist

No responses yet