📜  Apache Spark RDD持久性

📅  最后修改于: 2020-12-27 02:38:14             🧑  作者: Mango

RDD持久性

通过将数据持久存储在操作中的内存中,Spark提供了一种方便的方法来处理数据集。在保留RDD时,每个节点会将其计算的任何分区存储在内存中。现在,我们还可以在该数据集的其他任务中重用它们。

我们可以使用persist()或cache()方法来标记要保留的RDD。 Spark的缓存是容错的。无论如何,如果RDD的分区丢失了,它将使用最初创建它的转换自动重新计算它。

可以使用不同的存储级别来存储持久的RDD。通过将StorageLevel对象(Scala,Java, Python)传递给persist(),可以使用这些级别。但是,将cache()方法用于默认存储级别,即StorageLevel.MEMORY_ONLY。

以下是存储级别集:

Storage Level Description
MEMORY_ONLY It stores the RDD as deserialized Java objects in the JVM. This is the default level. If the RDD doesn’t fit in memory, some partitions will not be cached and recomputed each time they’re needed.
MEMORY_AND_DISK It stores the RDD as deserialized Java objects in the JVM. If the RDD doesn’t fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.
MEMORY_ONLY_SER
(Java and Scala)
It stores RDD as serialized Java objects ( i.e. one-byte array per partition). This is generally more space-efficient than deserialized objects.
MEMORY_AND_DISK_SER
(Java and Scala)
It is similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them.
DISK_ONLY It stores the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. It is the same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP (experimental) It is similar to MEMORY_ONLY_SER, but store the data in off-heap memory. The off-heap memory must be enabled.