📜  Hadoop – HDFS(Hadoop 分布式文件系统)(1)

📅  最后修改于: 2023-12-03 14:41:40.996000             🧑  作者: Mango

Hadoop - HDFS (Hadoop Distributed File System)

Hadoop Distributed File System (HDFS) is a distributed file system that provides high-throughput access to data and is designed to scale up from a single server to thousands of machines. It is one of the core components of the Hadoop ecosystem and is primarily used for storing and processing large datasets.

Architecture

HDFS is a master-slave architecture with two main components: NameNode and DataNode.

  • NameNode: This is the master node that manages the file system namespace and regulates access to files by clients. It stores the metadata about every file and directory in the file system, including the location of their blocks.
  • DataNode: These are the slave nodes that store the actual data blocks that make up the files. They are responsible for reading and writing data to/from the file system.

The file data is divided into fixed-size blocks and is distributed across multiple DataNodes in the cluster. The NameNode maintains the metadata about the blocks and their location in the file system. This allows HDFS to provide fault tolerance by replicating the blocks across multiple DataNodes.

Features

HDFS provides the following features:

  • Scalability: HDFS is designed to scale horizontally and can support thousands of nodes in a single cluster.
  • Fault tolerance: HDFS replicates data blocks across multiple DataNodes, providing fault tolerance in case of hardware or software failures.
  • High throughput: HDFS is optimized for sequential read/write operations and can handle large datasets with high throughput.
  • Streaming access: HDFS provides streaming access to data, allowing data to be processed in parallel by MapReduce or similar frameworks.
  • Data locality: HDFS tries to keep the data close to the processing nodes, reducing network congestion and improving performance.
Usage

HDFS is typically used in conjunction with other components of the Hadoop ecosystem, such as MapReduce, YARN, or Spark. It can be accessed through a variety of APIs, including Hadoop File System Shell, Hadoop Java API, and Hadoop Streaming API.

Here is an example of how to use the Hadoop File System Shell to interact with HDFS:

# Create a directory in HDFS
hdfs dfs -mkdir /mydir

# Upload a file to HDFS
hdfs dfs -put myfile /mydir

# List contents of a directory in HDFS
hdfs dfs -ls /mydir
Conclusion

HDFS is a distributed file system that provides scalable and fault-tolerant storage for large datasets. It is a critical component of the Hadoop ecosystem and is widely used in big data processing applications. By leveraging the features of HDFS, developers can build robust and scalable applications that can handle large volumes of data with ease.