从 Pandas 数据框中删除列中缺少值或 NaN 的行
Pandas 提供了各种数据结构和操作来操作数值数据和时间序列。但是,在某些情况下,某些数据可能会丢失。在 Pandas 中,缺失数据由两个值表示:
- None: None 是一个Python单例对象,通常用于Python代码中的缺失数据。
- NaN: NaN(Not a Number 的首字母缩写词),是所有使用标准 IEEE 浮点表示的系统都可以识别的特殊浮点值
Pandas 将None
和NaN
视为本质上可以互换以指示缺失值或空值。为了从数据框中删除空值,我们使用dropna()
函数,该函数以不同的方式删除具有空值的数据集的行/列。
Syntax:
DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
Parameters:
axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and ‘index’ or ‘columns’ for String.
how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.
thresh: thresh takes integer value which tells minimum amount of na values to drop.
subset: It’s an array which limits the dropping process to passed rows/columns through list.
inplace: It is a boolean which makes the changes in data frame itself if True.
代码 #1:删除至少有 1 个空值的行。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
现在我们删除具有至少一个 Nan 值(Null 值)的行
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# using dropna() function
df.dropna()
输出:
代码 #2:如果该行中的所有值都丢失,则删除行。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
现在我们删除所有数据丢失或包含空值(NaN)的行
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
# using dropna() function
df.dropna(how = 'all')
输出:
代码 #3:删除至少有 1 个空值的列。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
现在我们删除至少有 1 个缺失值的列
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# using dropna() function
df.dropna(axis = 1)
输出 :
代码 #4:删除 CSV 文件中至少有 1 个空值的行。
注意:在此,我们使用的是 CSV 文件,要下载使用的 CSV 文件,请单击此处。
# importing pandas module
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
new_data
输出:
现在我们比较数据帧的大小,以便我们可以知道有多少行至少有 1 个 Null 值
print("Old data frame length:", len(data))
print("New data frame length:", len(new_data))
print("Number of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
输出 :
Old data frame length: 1000
New data frame length: 764
Number of rows with at least 1 NA value: 236
由于差异为 236,因此有 236 行在任何列中至少有 1 个 Null 值。