📜  spark.read.load - Python (1)

📅  最后修改于: 2023-12-03 14:47:31.546000             🧑  作者: Mango

Spark.read.load

The spark.read.load function is a method in Spark that is used to read data from various supported sources and return a DataFrame. It is available in the Python API of Spark.

Syntax

The syntax for spark.read.load is as follows:

spark.read.load(
    path: str,
    format: Optional[str] = None,
    schema: Optional[Union[str, StructType]] = None,
    **options: str
) -> DataFrame

The parameters for this function are:

  • path: The path to the data source directory or file.
  • format: The format of the data source. If not specified, Spark will try to infer it based on the file extension or use a default format.
  • schema: The optional schema of the data to be read. It can be specified as a StructType object or a DDL-formatted string.
  • options: Additional options as key-value pairs to customize the behavior of the data source.
Return Value

The spark.read.load function returns a DataFrame object, which represents the structured data.

Examples
Example 1: Reading CSV data
df = spark.read.load('data.csv', format='csv', header='true', inferSchema='true')
df.show()

In this example, the spark.read.load function is used to read a CSV file named 'data.csv'. The format is explicitly specified as 'csv'. The function also specifies that the file has a header and the schema should be inferred from the data (both by setting appropriate options). The resulting DataFrame is then displayed.

Example 2: Reading Parquet data
df = spark.read.load('data.parquet', format='parquet')
df.show()

In this example, the spark.read.load function is used to read a Parquet file named 'data.parquet'. The format is explicitly specified as 'parquet'. Since no specific options are provided, Spark will use default options appropriate for Parquet.

Conclusion

The spark.read.load function is a versatile method in Spark that enables reading data from various sources. It provides flexible options to read different formats and allows customizing the behavior through additional options. The returned DataFrame can be further manipulated and analyzed using Spark's powerful APIs.