Spark¢RDD(1) - 芒果文档

📌 相关文章

📜 Spark¢RDD(1)

📅 最后修改于: 2023-12-03 15:05:14.972000 🧑 作者: Mango

Spark RDD

Apache Spark is a distributed computing framework that enables fast and scalable processing of large-scale data. One of the core components of Spark is RDD (Resilient Distributed Dataset), which is a distributed collection of data that can be processed in parallel.

Creating RDD

RDD can be created in multiple ways, including:

Parallelizing an existing collection in your driver program: sc.parallelize(Seq(1, 2, 3))
Referencing an external dataset, such as a file in HDFS, HBase, or any data source supported by Hadoop: sc.textFile("hdfs://path/to/file.txt")
Transforming an existing RDD with operations such as map, filter, or groupByKey: rdd.map(x => x * 2)

Transformations

RDD supports two types of operations: transformations and actions. Transformations are lazy, which means they do not compute the result immediately but create a new RDD. Examples of transformations include:

map: applies a function to each element in the RDD and returns a new RDD.
filter: selects elements from the RDD that satisfy a condition and returns a new RDD.
flatMap: applies a function to each element in the RDD and returns a flattened new RDD.
groupByKey: groups elements with the same key and returns a new RDD of (key, values) pairs.

Actions

Actions are operations that trigger computation on the RDD and return a result. Examples of actions include:

count: returns the number of elements in the RDD.
collect: returns all the elements in the RDD to the driver program.
reduce: applies a function to reduce the elements in the RDD to a single value.

Caching

RDD can be cached using the cache method to persist it in memory across operations. This can improve performance by reducing the amount of I/O needed to compute subsequent operations on the RDD.

Conclusion

Spark RDD is a powerful abstraction that enables efficient and scalable processing of large-scale data. Its support for transformations and actions allows for flexible and expressive data processing workflows.