Pandas 中的系统采样
抽样是一种方法,可以从给定的数据中提取子集(样本),并对样本进行调查,而无需调查数据的每一件事。例如,假设在一所大学里,有人想检查在这所大学学习的学生的平均身高。一种方法是收集所有学生的数据并进行计算,但这项任务非常耗时。因此,使用采样。因此,解决方案是在课间休息期间,从食堂中随机选择学生并测量他们的身高,然后根据该学生的子集计算平均身高。
采样类型:
系统抽样
系统抽样被定义为概率抽样的类型,研究人员可以在其中研究来自大量数据的目标数据。通过选择随机起点来选择目标数据,并在一定间隔后从中选择下一个元素作为样本。在这种情况下,从大数据中提取了一个小子集(样本)。
假设数据的大小为D , N将是我们要选择的样本大小。所以根据系统抽样:
Interval = (D/N)
Suppose (D/N) = J
So when we choose first random element E from Data , the next element for Sample would be (E+J)
Example : Total Size of Data = 50 (1 to 50)
We want elements in Sample = 5
Interval = 50/5 = 10 .
It means in a sample we want gapping of 10 elements Systematically.
Suppose i randomly choose element first Sample Element = 5
So next would be 5+10 = 15
15+10= 25
25+ 10 =35
35+10 = 45
So,
Sample = { 5,15,25,35,45 }
从图上看,
方法:
- 取数据。
- 从大数据中提取系统样本。
- 打印样本数据的平均值。
程序:
Python3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
# Define total number of students
number_of_students = 15
# Create data dictionary
data = {'Id': np.arange(1, number_of_students+1).tolist(),
'height': [159, 171, 158, 162, 162, 177, 160, 175,
168, 171, 178, 178, 173, 177, 164]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
display(df)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, 3)
# View sampled data frame
display(systematic_sample)
Python3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
# Define total number of students
number_of_students = 15
# Create data dictionary
data = {'Id': np.arange(1, number_of_students+1).tolist(),
'height': [159, 171, 158, 162, 162, 177, 160, 175,
168, 171, 178, 178, 173, 177, 164]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, 3)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['height'].mean())
print("Average Height in cm: ", systematic_data)
Python3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
import random
# Define total number of house
number_of_house = 30
# Create data dictionary
data = {'house_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30],
'number_of_children': [2, 2, 1, 3, 2, 1, 4, 1, 3, 5, 4, 3, 5,
3, 2, 1, 2, 3, 4, 5, 3, 4, 5, 2, 2, 2,
2, 3, 2, 1]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Defining Size of Systematic Sample
size_of_systematic_sample = 6
# Defining Interval(gap) in order to get required data.
interval = (number_of_house // size_of_systematic_sample)
# Choosing Random Number
random_number = random.randint(1, 30)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(random_number, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, interval)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['number_of_children'].mean())
# Printing Average Number of Children
print("Average Number Of Childrens in Locality: ", systematic_data)
Python3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
import random
# Define total number of boxes
number_of_boxes = 30
# Create data dictionary
data = {'Box_Number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30],
'Defective_Bulbs': [2, 2, 1, 0, 2, 1, 0, 1, 3, 5, 4, 3, 5, 3,
0, 1, 2, 0, 4, 5, 3, 4, 5, 2, 0, 3, 2, 0,
5, 4]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Size of Systematic Sample
size_systematic_sample = 5
# Interval (Gap) taken
interval = (number_of_boxes // size_systematic_sample)
# Choosing Random Starting Point
random_number = random.randint(1, 30)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(random_number, len(df)-1, step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, interval)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['Defective_Bulbs'].mean())
# Printing Average Number of Defective Bulbs
print("Average Number Of Defective Bulbs: ", systematic_data)
Python3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
import random
# Define total number of house
number_of_house = 30
# Create data dictionary
data = {'house_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30],
'number_of_Adults': [2, 2, 5, 3, 2, 8, 4, 7, 8, 5, 4, 9, 5,
4, 2, 3, 2, 3, 4, 5, 6, 4, 5, 4, 2, 6,
2, 3, 2, 2]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Defining Size of Systematic Sample
size_of_systematic_sample = 6
# Defining Interval(gap) in order to get required data.
interval = (number_of_house // size_of_systematic_sample)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, interval)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['number_of_Adults'].mean())
# Printing Average Number of Children
print("Average Number Of Adults in Locality: ", systematic_data)
输出:
示例:打印样本数据的平均值
蟒蛇3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
# Define total number of students
number_of_students = 15
# Create data dictionary
data = {'Id': np.arange(1, number_of_students+1).tolist(),
'height': [159, 171, 158, 162, 162, 177, 160, 175,
168, 171, 178, 178, 173, 177, 164]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, 3)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['height'].mean())
print("Average Height in cm: ", systematic_data)
输出:
系统抽样的类型
系统抽样分为以下三种类型:
系统随机抽样:
在系统随机抽样中,选择随机起点,然后从该随机起点应用系统抽样。
方法:
- 获取数据
- 选择一个随机起点
- 对数据应用系统方法
- 按预期执行操作
例子:
蟒蛇3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
import random
# Define total number of house
number_of_house = 30
# Create data dictionary
data = {'house_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30],
'number_of_children': [2, 2, 1, 3, 2, 1, 4, 1, 3, 5, 4, 3, 5,
3, 2, 1, 2, 3, 4, 5, 3, 4, 5, 2, 2, 2,
2, 3, 2, 1]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Defining Size of Systematic Sample
size_of_systematic_sample = 6
# Defining Interval(gap) in order to get required data.
interval = (number_of_house // size_of_systematic_sample)
# Choosing Random Number
random_number = random.randint(1, 30)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(random_number, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, interval)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['number_of_children'].mean())
# Printing Average Number of Children
print("Average Number Of Childrens in Locality: ", systematic_data)
输出:
线性系统采样:
线性系统抽样是一种系统抽样,其中使用线性方法选择样本。线性方法,在特定时间间隔后从大数据中选择样本,然后对选定的样本执行操作。
元素在范围starting_random_number 到last_element -1 之间选择。
方法:
- 获取数据
- 在特定时间间隔后从数据集中选择数据
- 按预期执行操作
例子:
蟒蛇3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
import random
# Define total number of boxes
number_of_boxes = 30
# Create data dictionary
data = {'Box_Number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30],
'Defective_Bulbs': [2, 2, 1, 0, 2, 1, 0, 1, 3, 5, 4, 3, 5, 3,
0, 1, 2, 0, 4, 5, 3, 4, 5, 2, 0, 3, 2, 0,
5, 4]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Size of Systematic Sample
size_systematic_sample = 5
# Interval (Gap) taken
interval = (number_of_boxes // size_systematic_sample)
# Choosing Random Starting Point
random_number = random.randint(1, 30)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(random_number, len(df)-1, step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, interval)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['Defective_Bulbs'].mean())
# Printing Average Number of Defective Bulbs
print("Average Number Of Defective Bulbs: ", systematic_data)
输出:
循环系统抽样
在循环系统抽样中,样本结束后再次从同一点开始。基本上,在系统地选择样本时,当到达结束元素时,样本的选择将再次从头开始,直到样本的所有元素都被选中。这意味着对使用循环系统采样选择的所有数据执行操作。
方法:
- 获取数据
- 系统地选择样品
- 到达结束后,重新启动
- 按预期执行操作
程序:
蟒蛇3
# Import in order to use inbuilt functions
import numpy as np
import pandas as pd
import random
# Define total number of house
number_of_house = 30
# Create data dictionary
data = {'house_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30],
'number_of_Adults': [2, 2, 5, 3, 2, 8, 4, 7, 8, 5, 4, 9, 5,
4, 2, 3, 2, 3, 4, 5, 6, 4, 5, 4, 2, 6,
2, 3, 2, 2]}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# Defining Size of Systematic Sample
size_of_systematic_sample = 6
# Defining Interval(gap) in order to get required data.
interval = (number_of_house // size_of_systematic_sample)
# Define systematic sampling function
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, interval)
# View sampled data frame
display(systematic_sample)
# Empty Print Statement for new line
print()
# Save the sample data in a separate variable
systematic_data = round(systematic_sample['number_of_Adults'].mean())
# Printing Average Number of Children
print("Average Number Of Adults in Locality: ", systematic_data)
输出: