📅  最后修改于: 2021-01-05 03:04:43             🧑  作者: Mango
Apache Spark是一个分布式的通用处理系统,可以一次处理PB级的数据。它主要用于流传输和处理数据。它分布在数千个虚拟服务器之间。大型组织使用Spark处理大量数据集。 Apache Spark允许使用大约80个高级运算符更快地构建应用程序。通过查询优化器,物理执行引擎和DAG调度程序,它可以获得流和批处理数据的高性能。因此,它的速度快了一百倍。
Apache spark通过Spark Streaming启用大型数据集的流传输。 Spark Streaming是核心Spark API的一部分,可让用户处理实时数据流。它从不同的数据源获取数据,并使用复杂的算法对其进行处理。最后,将处理后的数据推送到实时仪表板,数据库和文件系统中。
一个客户端库,用于处理和分析存储在Kafka中的数据。 Kafka流使用户能够构建应用程序和微服务。此外,将输出存储在Kafka集群中。除了Kafka之外,它对系统没有任何外部依赖性。它一次只处理一条记录。
Parameters | Apache Kafka | Apache Spark |
---|---|---|
Developers | Originally developed by LinkedIn. Later, donated to Apache Software Foundation. | Originally developed at the University of California. Later, it was donated to Apache Software Foundation. |
Infrastructure | It is a Java client library. Thus, it can execute wherever Java is supported. | It executes on the top of the Spark stack. It can be either Spark standalone, YARN, or container-based. |
Data Sources | It processes data from Kafka itself via topics and streams. | Spark ingest data from various files, Kafka, Socket source, etc. |
Processing Model | It processes the events as it arrives. Thus, it uses Event-at-a-time (continuous) processing model. | It has a micro-batch processing model. It splits the incoming streams into small batches for further processing. |
Latency | It has low latency than Apache Spark | It has a higher latency. |
ETL Transformation | It is not supported in Apache Kafka. | This transformation is supported in Spark. |
Fault-tolerance | Fault-tolerance is complex in Kafka. | Fault-tolerance is easy in Spark. |
Language Support | It supports Java mainly. | It supports multiple languages such as Java, Scala, R, Python. |
Use Cases | The New York Times, Zalando, Trivago, etc. use Kafka Streams to store and distribute data. | Booking.com, Yelp (ad platform) uses Spark streams for handling millions of ad requests per day. |