Big data brings with it two fundamental challenges: how to store and work with voluminous data sizes, and more important, how to understand data and turn it into a competitive advantage.
Hadoop is a distributed filesystem, and it offers a way to parallellize and execute progams on a cluster of machines.
Figure 1.3 – Topography
The HDFS namenode keeps in memory the metadata about the filesystem such as which datanodes manage the blocks for each file.
HDFS clients talk to the namenode for metadata-related activities and DataNodes for reading and writing files.
DataNodes communicate with each other for pipelining file reads and writes.
Files are mede up of blocks, and each file can be replicated multiple times, meaning there are many identical copies of each block for the file (default = 3).