如何计算 Pandas 列中特定值的出现次数?
在本文中,我们将讨论如何计算 pandas 列中特定列值的出现次数。
正在使用的数据集:
我们可以使用 value_counts() 方法进行计数。此函数用于计算整个数据帧中存在的值,也用于计算特定列中的值。
语法:
data['column_name'].value_counts()[value]
在哪里
- 数据是输入数据框
- value 是要计算的列中存在的字符串/整数值
- column_name 是数据框中的列
示例:计算特定值的出现次数
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# count values in name column
print(data['name'].value_counts()['sravan'])
# count values in subjects column
print(data['subjects'].value_counts()['php'])
# count values in marks column
print(data['marks'].value_counts()[89])
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# count all values in name column
print(data['name'].value_counts())
# count all values in subjects column
print(data['subjects'].value_counts())
# count all values in marks column
print(data['marks'].value_counts())
# count all values in age column
print(data['age'].value_counts())
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# count all values in name column in ascending order
print(data['name'].value_counts(ascending=True))
# count all values in subjects column in ascending order
print(data['subjects'].value_counts(ascending=True))
# count all values in marks column in descending order
print(data['marks'].value_counts(ascending=False))
# count all values in age column in descending order
print(data['age'].value_counts(ascending=False))
Python3
# import pandas module
import pandas as pd
#import numpy
import numpy
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith', 'gnanesh',
'sravan', 'sravan', 'ojaswi', numpy.nan],
'subjects': ['java', 'php', 'java', 'php', 'java', 'html/css',
'python', 'R', numpy.nan],
'marks': [98, 90, 78, 91, 87, 78, 89, 90, numpy.nan],
'age': [11, 23, 23, 21, 21, 21, 23, 21, numpy.nan]
})
# count all values in name column including NA
print(data['name'].value_counts(dropna=False))
# count all values in subjects column including NA
print(data['subjects'].value_counts(dropna=False))
# count all values in marks column excluding NA
print(data['marks'].value_counts(dropna=False))
# count all values in age column excluding NA
print(data['age'].value_counts(dropna=True))
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# count all values in name with relative frequencies
print(data['name'].value_counts(normalize=True))
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get about age
print(data['age'].describe())
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get the size of name
print(data.groupby('name').size())
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get the count of name across all columns
print(data.groupby('name').count())
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get count of age column with 6 bins
print(data['age'].value_counts(bins=6))
# get count of age column with 4 bins
print(data['age'].value_counts(bins=4))
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'bobby', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'html/css', 'python'],
'marks': [98, 90, 78, 91, 87],
'age': [11, 23, 23, 21, 21]
})
# get all count
data.apply(pd.value_counts)
输出:
3
2
1
如果我们想计算特定列中的所有值,那么我们不需要提及该值。
语法:
data['column_name'].value_counts()
示例:计算特定列中某个值的出现次数
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# count all values in name column
print(data['name'].value_counts())
# count all values in subjects column
print(data['subjects'].value_counts())
# count all values in marks column
print(data['marks'].value_counts())
# count all values in age column
print(data['age'].value_counts())
输出:
如果我们想按顺序(如升序和降序)得到结果,我们必须指定参数
句法:
Ascending order:
data[‘column_name’].value_counts(ascending=True)
Descending Order:
data[‘column_name’].value_counts(ascending=False)
示例:以有序方式获取结果
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# count all values in name column in ascending order
print(data['name'].value_counts(ascending=True))
# count all values in subjects column in ascending order
print(data['subjects'].value_counts(ascending=True))
# count all values in marks column in descending order
print(data['marks'].value_counts(ascending=False))
# count all values in age column in descending order
print(data['age'].value_counts(ascending=False))
输出:
处理缺失值
在这里,我们可以计算有或没有 NA 值的出现。如果设置为 True,则使用 dropna 参数包含 NA 值,如果设置为 False,则不会计算 NA。
语法:
Include NA values:
data[‘column_name’].value_counts(dropna=True)
Exclude NA Values:
data[‘column_name’].value_counts(dropna=False)
示例:处理缺失值
Python3
# import pandas module
import pandas as pd
#import numpy
import numpy
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith', 'gnanesh',
'sravan', 'sravan', 'ojaswi', numpy.nan],
'subjects': ['java', 'php', 'java', 'php', 'java', 'html/css',
'python', 'R', numpy.nan],
'marks': [98, 90, 78, 91, 87, 78, 89, 90, numpy.nan],
'age': [11, 23, 23, 21, 21, 21, 23, 21, numpy.nan]
})
# count all values in name column including NA
print(data['name'].value_counts(dropna=False))
# count all values in subjects column including NA
print(data['subjects'].value_counts(dropna=False))
# count all values in marks column excluding NA
print(data['marks'].value_counts(dropna=False))
# count all values in age column excluding NA
print(data['age'].value_counts(dropna=True))
输出:
以相对频率计数值
我们将添加 normalize 参数以获取重复数据的相对频率。它设置为真。
句法:
data[‘column_name’].value_counts(normalize=True)
示例:计算具有相对频率的值
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# count all values in name with relative frequencies
print(data['name'].value_counts(normalize=True))
输出:
sravan 0.375
ojaswi 0.125
ojsawi 0.125
bobby 0.125
rohith 0.125
gnanesh 0.125
Name: name, dtype: float64
获取详细信息
如果我们想得到像计数、平均值、标准差、最小值、25%、50%、75%、最大值这样的细节,那么我们必须使用 describe() 方法。
语法:
data['column_name'].describe()
示例:获取详细信息
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get about age
print(data['age'].describe())
输出:
count 8.000000
mean 20.500000
std 3.964125
min 11.000000
25% 21.000000
50% 21.000000
75% 23.000000
max 23.000000
Name: age, dtype: float64
将 size() 与 groupby() 一起使用
在这里,这将返回特定列中所有出现的计数。
语法:
data.groupby('column_name').size()
示例:特定列中所有出现的计数
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get the size of name
print(data.groupby('name').size())
输出:
name
bobby 1
gnanesh 1
ojaswi 1
ojsawi 1
rohith 1
sravan 3
dtype: int64
将 count() 与 groupby() 一起使用
在这里,这将返回所有列中特定列中所有出现的计数。
语法:
data.groupby('column_name').count()
示例:特定列中所有出现的计数
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get the count of name across all columns
print(data.groupby('name').count())
输出:
使用垃圾箱
如果我们想获得特定值范围内的计数,则应用 bins 参数。我们可以指定范围(箱)的数量。
句法:
(data['column_name'].value_counts(bins)
在哪里,
- 数据是输入数据框
- column_name 是获取 bin 的列
- bins 是要指定的 bin 总数
示例:获取特定值范围内的计数
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})
# get count of age column with 6 bins
print(data['age'].value_counts(bins=6))
# get count of age column with 4 bins
print(data['age'].value_counts(bins=4))
输出:
(19.0, 21.0] 4
(21.0, 23.0] 3
(10.987, 13.0] 1
(17.0, 19.0] 0
(15.0, 17.0] 0
(13.0, 15.0] 0
Name: age, dtype: int64
(20.0, 23.0] 7
(10.987, 14.0] 1
(17.0, 20.0] 0
(14.0, 17.0] 0
Name: age, dtype: int64
使用应用()
如果我们想获得所有列中所有列的计数,那么我们必须使用 apply()函数。我们将使用 value_counts() 方法。
句法:
data.apply(pd.value_counts)
示例:获取所有列中所有列的计数
Python3
# import pandas module
import pandas as pd
# create a dataframe
# with 5 rows and 4 columns
data = pd.DataFrame({
'name': ['sravan', 'bobby', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'html/css', 'python'],
'marks': [98, 90, 78, 91, 87],
'age': [11, 23, 23, 21, 21]
})
# get all count
data.apply(pd.value_counts)
输出: