📅  最后修改于: 2023-12-03 15:05:14.972000             🧑  作者: Mango
Apache Spark is a distributed computing framework that enables fast and scalable processing of large-scale data. One of the core components of Spark is RDD (Resilient Distributed Dataset), which is a distributed collection of data that can be processed in parallel.
RDD can be created in multiple ways, including:
Parallelizing an existing collection in your driver program: sc.parallelize(Seq(1, 2, 3))
Referencing an external dataset, such as a file in HDFS, HBase, or any data source supported by Hadoop: sc.textFile("hdfs://path/to/file.txt")
Transforming an existing RDD with operations such as map
, filter
, or groupByKey
: rdd.map(x => x * 2)
RDD supports two types of operations: transformations and actions. Transformations are lazy, which means they do not compute the result immediately but create a new RDD. Examples of transformations include:
map
: applies a function to each element in the RDD and returns a new RDD.
filter
: selects elements from the RDD that satisfy a condition and returns a new RDD.
flatMap
: applies a function to each element in the RDD and returns a flattened new RDD.
groupByKey
: groups elements with the same key and returns a new RDD of (key, values)
pairs.
Actions are operations that trigger computation on the RDD and return a result. Examples of actions include:
count
: returns the number of elements in the RDD.
collect
: returns all the elements in the RDD to the driver program.
reduce
: applies a function to reduce the elements in the RDD to a single value.
RDD can be cached using the cache
method to persist it in memory across operations. This can improve performance by reducing the amount of I/O needed to compute subsequent operations on the RDD.
Spark RDD is a powerful abstraction that enables efficient and scalable processing of large-scale data. Its support for transformations and actions allows for flexible and expressive data processing workflows.