📜  pd.read_csv - Python (1)

📅  最后修改于: 2023-12-03 15:33:25.963000             🧑  作者: Mango

pd.read_csv - Python

pd.read_csv is a function in the Pandas library of Python used to read data from a CSV file and create a Pandas DataFrame.

Syntax
pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)
Parameters
  • filepath_or_buffer (str, path object or file-like object): Path of the file or the file object itself

  • sep (str): Delimiter to use. Default is ','.

  • delimiter (str): Alternative delimiter to use. Default is None. If None, then sep is used.

  • header (int, list of int, default 'infer'): Row(s) to use as the column names. If 'infer' and skiprows is not specified, header is set to 0. If a list of integers is passed, the indexes will be combined into a MultiIndex.

  • names (array-like): List of column names to use. If the file contains no header row, then you should explicitly pass header=None.

  • index_col (int, str, sequence[int/str], or False, default None): Column(s) to set as index(MultiIndex).

  • usecols (list-like or callable, optional): Return a subset of the columns. If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.

  • squeeze (bool, default False): If the source data file has only one column, return a pandas Series instead of a DataFrame.

  • dtype (Type name or dict of column -> type, optional): Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} (Unsupported with python engine).

  • engine ({'c', 'python'}, optional): Parser engine to use. Runs faster on C-engines. Default is 'c'. If 'c' parser does not support the data, then the 'python' parser will be used.

  • true_values (list of str or None, optional): Values to consider as boolean True.

  • false_values (list of str or None, optional): Values to consider as boolean False.

  • skipinitialspace (bool, default False): Skip spaces after delimiter.

  • skiprows (list-like or integer, optional): Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

  • nrows (int, optional): Number of rows of file to read. Useful for reading pieces of large files.

  • na_values (scalar, str, list-like, or dict, optional): Additional strings to recognize as NA/NaN.

  • keep_default_na (bool, default True): If True, keep the default NaN values when parsing the input.

  • na_filter (bool, default True): Detect missing (i.e. NaN or None) values. Turn off if not necessary.

  • skip_blank_lines (bool, default True): Skip empty lines instead of interpreting as NaN values.

  • parse_dates (bool or list of int or names or List of lists of strings or ints or dict, default False): Boolean flag or list of column numbers/names to parse as date. If dict, contains {column_name: format string}. Date parsing can be done in read_csv by setting parse_dates=True and passing in a list of date column names to parse. However, this default behavior is often not what you actually want. For example, consider a file containing the following five lines (note that the fourth line is blank):

    01/01/2010,10000,1000
    02/01/2010,20000,2000
    03/01/2010,30000,3000
         
    05/01/2010,50000,5000
    

    The following code parses everything as expected:

    >>> import pandas as pd
    >>> df = pd.read_csv(file, header=None, names=['date', 'value1', 'value2'], parse_dates=['date'])
    >>> df
           date  value1  value2
    0 2010-01-01   10000    1000
    1 2010-02-01   20000    2000
    2 2010-03-01   30000    3000
    3 2010-05-01   50000    5000
    

    However, suppose you wanted to parse the 'date' column only if it was in the first column. Instead of passing in a list of column names to parse as dates, you can pass in a dictionary where keys are the column number/names to parse as dates, and values are the format strings for each column.

  • infer_datetime_format (bool, default False): If True and parse_dates is True and the format of the datetime string in the input file is not specified, Pandas will attempt to infer the format of the datetime strings in the columns. This can be faster for parsing date strings that are not in the standard format but may result in parsing errors.

  • keep_date_col (bool, default False): If True and parse_dates specifies combining multiple columns then keep the original columns.

  • date_parser (function, optional): Function to use for converting a sequence of string columns to an array of datetime instances.

  • dayfirst (bool, default False): If True, parse dates in dd/mm/yyyy format.

  • iterator (bool, default False): Return a TextFileReader object for iterating over the file line by line.

  • chunksize (int, optional): Return TextFileReader object for iterating over the file line by line with chunksize number of rows at a time.

  • compression ({'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'): For on-the-fly decompression of on-disk data. If 'infer', then use gzip, bz2, zip, xz or None according to the file extension. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

  • thousands (str, optional): Thousands separator.

  • decimal (str, default. '.'): Character to recognize as decimal point (e.g. use ',' for European data).

  • lineterminator (str, optional): Line separate to use. Default is '\n'.

  • quotechar (str, default '"'): Character used to quote fields.

  • quoting (int or csv.QUOTE_* instance, default 0): Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

  • escapechar (str, optional): Character used to escape quotechar when quoting is enabled.

  • comment (str, optional): Character used to denote start of line comment.

  • encoding (str or None, optional): The text encoding of the input file. By default, Pandas will try to infer the encoding of the file using chardet.

  • dialect (str or csv.Dialect instance, optional): If specified, the dialect will be passed to the csv module. E.g. {‘delimiter’: ‘|’}.

  • error_bad_lines (bool, default True): Lines with too many fields (as defined by delimiter) will by default cause an exception to be raised, and no DataFrame will be returned. If False, these "bad lines" will dropped from the DataFrame that is returned.

  • warn_bad_lines (bool, default True): If error_bad_lines is False and some lines has more fields than expected, warn the user instead of raising a parsing exception.

  • delim_whitespace (bool, default False): Specify True to use whitespace (tab or space) as the delimiter.

Returns

A DataFrame or Series.

Example
import pandas as pd

# Load data from CSV file
df = pd.read_csv("data.csv")

# Display the DataFrame
print(df.head())