📅  最后修改于: 2023-12-03 14:47:31.546000             🧑  作者: Mango
The spark.read.load
function is a method in Spark that is used to read data from various supported sources and return a DataFrame. It is available in the Python API of Spark.
The syntax for spark.read.load
is as follows:
spark.read.load(
path: str,
format: Optional[str] = None,
schema: Optional[Union[str, StructType]] = None,
**options: str
) -> DataFrame
The parameters for this function are:
path
: The path to the data source directory or file.format
: The format of the data source. If not specified, Spark will try to infer it based on the file extension or use a default format.schema
: The optional schema of the data to be read. It can be specified as a StructType object or a DDL-formatted string.options
: Additional options as key-value pairs to customize the behavior of the data source.The spark.read.load
function returns a DataFrame object, which represents the structured data.
df = spark.read.load('data.csv', format='csv', header='true', inferSchema='true')
df.show()
In this example, the spark.read.load
function is used to read a CSV file named 'data.csv'. The format is explicitly specified as 'csv'. The function also specifies that the file has a header and the schema should be inferred from the data (both by setting appropriate options). The resulting DataFrame is then displayed.
df = spark.read.load('data.parquet', format='parquet')
df.show()
In this example, the spark.read.load
function is used to read a Parquet file named 'data.parquet'. The format is explicitly specified as 'parquet'. Since no specific options are provided, Spark will use default options appropriate for Parquet.
The spark.read.load
function is a versatile method in Spark that enables reading data from various sources. It provides flexible options to read different formats and allows customizing the behavior through additional options. The returned DataFrame can be further manipulated and analyzed using Spark's powerful APIs.