📜  Hadoop 和 Apache Spark 的区别

📅  最后修改于: 2021-09-12 10:34:25             🧑  作者: Mango

Hadoop:它是一个开源软件实用程序的集合,有助于使用由多台计算机组成的网络来解决涉及大量数据和计算的问题。它提供了一个使用 MapReduce 编程模型对大数据进行分布式存储和处理的软件框架。
Hadoop 是用Java构建的,可通过多种编程语言访问,用于通过 Thrift 客户端编写 MapReduce 代码,包括Python。它可以通过 Apache 发行版开源,也可以通过 Cloudera(规模和范围最大的 Hadoop 供应商)、MapR 或 HortonWorks 等供应商获得。

Apache Spark:它是一个开源分布式通用集群计算框架。 Spark 为整个集群编程提供了一个接口,具有隐式数据并行性和容错性。
Spark 围绕 Spark Core 构建,该引擎驱动调度、优化和 RDD 抽象,并将 Spark 连接到正确的文件系统(HDFS、S3、RDBMS 或 Elasticsearch)。有几个库在 Spark Core 之上运行,包括 Spark SQL,它允许你在分布式数据集上运行类似 SQL 的命令,MLLib 用于机器学习,GraphX 用于图形问题,以及允许输入连续流的流记录数据。

Hadoop-vs-Apache-Spark

下表列出了 Hadoop 和 Apache Spark 之间的差异:

Features Hadoop Apache Spark
Data Processing Apache Hadoop provides batch processing Apache Spark provides both batch processing and stream processing
Memory usage Spark uses large amounts of RAM Hadoop is disk-bound
Security Better security features It security is currently in its infancy
Fault Tolerance Replication is used for fault tolerance RDD and various data storage models are used for fault tolereance
Graph Processing Algorithms like PageRank is used Spark comes with a graph computation library called GraphX
Ease of Use Difficult to use Easier to use
Real-time data processing It fails when it comes to real-time data processing It can process real-time data
Speed Hadoop’s MapReduce model reads and writes from a disk, thus slow down the processing speed Spark reduces the number of read/write cycles to disk and store intermediate data in-memory, hence faster-processing speed.
Latency It is high latency computing framework It is a low latency computing and can process data interactively