大数据框架——Hadoop vs Spark vs Flink

Hadoop是用Java编写的基于 Apache 的开源框架。它是著名的大数据工具之一，它使用其文件系统 HDFS（Hadoop 分布式文件系统）提供分布式存储和使用 Map-Reduce 编程模型的分布式处理功能。 Hadoop 使用商品硬件集群来存储和运行应用程序。由于 Hadoop 使用分布式计算模型来处理大数据。它还提供了许多提高其功能的功能。 Hadoop 提供了低成本、容错性、可扩展性、速度、数据局部性、高可用性等。Hadoop 生态系统也非常庞大，它提供了许多其他工具，可以在 Hadoop 之上工作并使其具有高度功能。

Spark是一种开源处理引擎，旨在简化分析操作。它是一个集群计算平台，设计速度快，专为通用用途而设计。 Spark 旨在涵盖各种批处理应用程序、机器学习、流数据处理和交互式查询。 Apache Spark 提供内存处理等功能，强大的处理引擎带有紧密集成的组件，使其高效。 Spark Streaming 有一个用于流处理的高级库。

Flink也是一个开源的流处理框架，受 Apache 许可。 Apache Flink 用于分布式和高性能数据流应用程序。它还支持其他处理，如机器学习中的图形处理、批处理和迭代处理等。但它以流处理而闻名。现在，我们可能会怀疑所有这些处理也可以用 Spark 完成，那么为什么我们需要 Flink。答案是 Flink 被认为是下一代流处理引擎，在速度上比 Spark 和 Hadoop 更快。如果 Hadoop 是 2G，Spark 是 3G，那么 Flink 将是 4G 用于大数据处理。 Flink 还为我们提供了低延迟和高吞吐量的应用程序。

下表列出了 Hadoop、Spark 和 Flink 之间的差异：

Based On	Apache Hadoop	Apache Spark	Apache Flink
Data Processing	Hadoop is mainly designed for batch processing which is very efficient in processing large datasets.	It supports batch processing as well as stream processing.	It supports both batch and stream processing. Flink also provides the single run-time for batch and stream processing.
Stream Engine	It takes the complete data-set as input at once and produces the output.	Process data streams in micro-batches.	The true streaming engine uses streams for workload: streaming, micro-batch, SQL, batch.
Data Flow	Data Flow does not contain any loops. supports linear data flow.	Spark supports cyclic data flow and represents it as (DAG) direct acyclic graph.	Flink uses a controlled cyclic dependency graph in run time. which efficiently manifest ML algorithms.
Computation Model	Hadoop Map-Reduce supports the batch-oriented model.	It supports the micro-batching computational model.	Flink supports a continuous operator-based streaming model.
Performance	Slower than Spark and Flink.	More than Hadoop lesser than Flink.	Performance is highest among these three.
Memory management	Configurable Memory management supports both dynamically or statically management.	The Latest release of spark has automatic memory management.	Supports automatic memory management
Fault tolerance	Highly fault-tolerant using a replication mechanism.	Spark RDD provides fault tolerance through lineage.	Fault tolerance is based on Chandy-Lamport distributed snapshots results in high throughput.
Scalability	Highly scalable and can be scaled up to tens of thousands of nodes.	Highly scalable.	It is also highly scalable.
Iterative Processing	Does not support Iterative Processing.	supports Iterative Processing.	supports Iterative Processing and iterate data with the help of its streaming architecture.
Supported Languages	Java, C, C++, Python, Perl, groovy, Ruby, etc.	Java, Python, R, Scala.	Java, Python, R, Scala.
Cost	Uses commodity hardware which is less expensive	Needed lots of RAM so the cost is relatively high.	Apache Flink also needed lots of RAM so the cost is relatively high.
Abstraction	No Abstraction in Map-Reduce.	Spark RDD abstraction	Flink supports Dataset abstraction for batch and DataStreams
SQL support	Users can run SQL queries using Apache Hive.	Users can run SQL queries using Spark-SQL. It also supports Hive for SQL.	Flink supports Table-API which are similar to SQL expression. Apache foundation is panning to add SQL interface in its future release.
Caching	Map-Reduce can not cache data.	It can cache data in memory	Flink can also cache data in memory
Hardware Requirements	Runs well on less expensive commodity hardware.	It also needed high-level hardware.	Apache Flink also needs High-level Hardware
Machine Learning	Apache Mahout is used for ML.	Spark is so powerful in implementing ML algorithms with its own ML libraries.	FlinkML library of Flink is used for ML implementation.
Line of code	Hadoop 2.0 has 1,20,000 lines of codes.	developed in 20000 lines of codes.	It is developed in Scala and Java so no. of lines of code is less then Hadoop.
High Availability	Configurable in High Availability Mode.	Configurable in High Availability Mode.	Configurable in High Availability Mode.
Amazon S3 connector	Provides Support for Amazon S3 Connector.	Provides Support for Amazon S3 Connector.	Provides Support for Amazon S3 Connector.
Backpressure Handing	Hadoop handles back-pressure through Manual Configuration.	Spark also handles back-pressure through Manual Configuration.	Apache Flink handles back-pressure Implicitly through System Architecture
Criteria for Windows	Hadoop does not have any windows criteria since it does not support streaming.	Spark has time-based window criteria.	Flink has record-based Flink Window criteria.
Apache License	Apache License 2.	Apache License 2.	Apache License 2.