Using Pandas & Pivot table 如何使用 column(level) groupby sum values 进行下一步分析?
Using Pandas & Pivot table how to use column(level) groupby sum values for the next steps analysis?
我想知道使用比例分配法从每个级别抽取多少样本。
我共有 3 个级别:[小、中、大]。
首先,我要计算这 3 个级别的总和。
接下来,我想找出这 3 个级别的概率
接下来,我想用这个概率答案乘以这 3 个水平给出的样本数
并且,最后一步是:样本将 select 作为每个级别的顶级村庄。
数据:
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
让我解释一下,我附上excel照片
第一步:我想获取级别 -> 小、中和高的总和值。即 ( 10+32+34+42)=118 小级别。
在下一步中,我想找出每个级别四舍五入到小数点后两位的概率。
即 ( 118/786) =0.15 为小级别。
并使用每个级别的长度(大小)乘以概率来找出从每个级别抽取的样本(村庄)数量。
即对于中等水平,我们有概率 0.25,并且我们有 3 个中等水平的村庄。因此,0.25*3 = 0.75 将从中等水平抽取样本。
因此,它将四舍五入到下一个整数 0.75 ~ 1 从中级抽取的样本,并取该级别的顶级村庄。所以,在中等水平 "Dhokri" 村庄将是 select,
我做了一些工作,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=True)
df['level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df
我正在使用此命令获取级别的总和。接下来要做什么我很困惑,
df=df.groupby(['level'])['Workers'].aggregate(['sum']).unstack()
是否有可能在 python 中获得我在使用 excel 中获得的村庄名称?
您可以使用:
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected village'] = df.groupby('Level')
.apply(lambda x: x['Village'].head(x['Selected villages'].iat[0]))
.reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Village Workers Level Sum_Level_wise Probability Sample \
0 Aagar 10 Small 118 0.15 0.60
1 Dhagewadi 32 Small 118 0.15 0.60
2 Sherewadi 34 Small 118 0.15 0.60
3 Shindwad 42 Small 118 0.15 0.60
4 Dhokari 84 Medium 194 0.25 0.75
5 Khanapur 65 Medium 194 0.25 0.75
6 Ambikanagar 45 Medium 194 0.25 0.75
7 Takali 127 Large 474 0.60 2.40
8 Gardhani 122 Large 474 0.60 2.40
9 Pi.Khand 120 Large 474 0.60 2.40
10 Pangri 105 Large 474 0.60 2.40
Selected villages Selected village
0 1 Aagar
1 1
2 1
3 1
4 1 Dhokari
5 1
6 1
7 3 Takali
8 3 Gardhani
9 3 Pi.Khand
10 3
您可以尝试使用自定义函数进行调试:
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
我想知道使用比例分配法从每个级别抽取多少样本。
我共有 3 个级别:[小、中、大]。
首先,我要计算这 3 个级别的总和。
接下来,我想找出这 3 个级别的概率
接下来,我想用这个概率答案乘以这 3 个水平给出的样本数
并且,最后一步是:样本将 select 作为每个级别的顶级村庄。
数据:
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
让我解释一下,我附上excel照片
第一步:我想获取级别 -> 小、中和高的总和值。即 ( 10+32+34+42)=118 小级别。
在下一步中,我想找出每个级别四舍五入到小数点后两位的概率。 即 ( 118/786) =0.15 为小级别。
并使用每个级别的长度(大小)乘以概率来找出从每个级别抽取的样本(村庄)数量。
即对于中等水平,我们有概率 0.25,并且我们有 3 个中等水平的村庄。因此,0.25*3 = 0.75 将从中等水平抽取样本。 因此,它将四舍五入到下一个整数 0.75 ~ 1 从中级抽取的样本,并取该级别的顶级村庄。所以,在中等水平 "Dhokri" 村庄将是 select,
我做了一些工作,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=True)
df['level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df
我正在使用此命令获取级别的总和。接下来要做什么我很困惑,
df=df.groupby(['level'])['Workers'].aggregate(['sum']).unstack()
是否有可能在 python 中获得我在使用 excel 中获得的村庄名称?
您可以使用:
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected village'] = df.groupby('Level')
.apply(lambda x: x['Village'].head(x['Selected villages'].iat[0]))
.reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Village Workers Level Sum_Level_wise Probability Sample \
0 Aagar 10 Small 118 0.15 0.60
1 Dhagewadi 32 Small 118 0.15 0.60
2 Sherewadi 34 Small 118 0.15 0.60
3 Shindwad 42 Small 118 0.15 0.60
4 Dhokari 84 Medium 194 0.25 0.75
5 Khanapur 65 Medium 194 0.25 0.75
6 Ambikanagar 45 Medium 194 0.25 0.75
7 Takali 127 Large 474 0.60 2.40
8 Gardhani 122 Large 474 0.60 2.40
9 Pi.Khand 120 Large 474 0.60 2.40
10 Pangri 105 Large 474 0.60 2.40
Selected villages Selected village
0 1 Aagar
1 1
2 1
3 1
4 1 Dhokari
5 1
6 1
7 3 Takali
8 3 Gardhani
9 3 Pi.Khand
10 3
您可以尝试使用自定义函数进行调试:
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')