如何根据概率分布将从列表中选取的值分配给 df 列?
How can I assign a value picked from a list to a df column based on a probability distribution?
import pandas as pd
d={'Country':['Algeria', 'France', 'Italy']*10, 'Input category':[1,2]*15, 'Output category':[0,0,0]*10}
df=pd.DataFrame(d)
df.sort_values(['Country', 'Input category']).reset_index(drop=True)
Country Input category Output category
0 Algeria 1 0
1 Algeria 1 0
2 Algeria 1 0
3 Algeria 1 0
4 Algeria 1 0
5 Algeria 2 0
6 Algeria 2 0
7 Algeria 2 0
8 Algeria 2 0
9 Algeria 2 0
10 France 1 0
11 France 1 0
12 France 1 0
13 France 1 0
14 France 1 0
15 France 2 0
16 France 2 0
17 France 2 0
18 France 2 0
19 France 2 0
20 Italy 1 0
21 Italy 1 0
22 Italy 1 0
23 Italy 1 0
24 Italy 1 0
25 Italy 2 0
26 Italy 2 0
27 Italy 2 0
28 Italy 2 0
29 Italy 2 0
我有一个数据集,其中包含许多行,这些行是来自具有输入类别 (1,2) 的国家/地区的个人。
每个唯一行出现 5 次(同一行出现 5 次,然后下一行出现 5 次,依此类推)。
我想要做的是在我的 df 中创建一个新列(比如说输出)并根据条件分布为其分配另一个值(也是 1 或 2)。
d={'Country': ['Algeria', 'France', 'Italy'] , 'p1_1':[2/5,1/5,1/5], 'p2_1':[3/5,4/5,4/5], 'p1_2':[2/5,3/5,5/5], 'p2_2':[3/5,2/5,0]}
cond_prob=pd.DataFrame(d)
cond_prob
Country p1_1 p2_1 p1_2 p2_2
0 Algeria 0.4 0.6 0.4 0.6
1 France 0.2 0.8 0.6 0.4
2 Italy 0.2 0.8 1.0 0.0
例如,因为对于阿尔及利亚 p1_1(输出 P = 1,输入 = 1)= 2/5,我想将输出 1 分配给我的行 2(因此输出 2 到剩下的 3 行)。
已编辑:这是预期的输出:
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 1
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 1
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
IIUC,
n=5
#
#s = df['Country'].value_counts()
#assert s.nunique() == 1
#n = s.iloc[0] // df['Input category'].nunique()
#print(n)
##5
df = df.sort_values(['Country', 'Input category']).reset_index(drop=True)
df2 = cond_prob.melt('Country').sort_values(['Country'])
df['Output Category'] = (df2.reindex(df2.index.repeat(df2['value'].mul(n)))['variable']
.str.extract('(\d+)')[0].values.astype(int))
print(df)
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 2
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 2
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
如果您需要按Input category
排序:
df2 = cond_prob.melt('Country').sort_values('Country')
df2 = df2.reindex(df2.index.repeat(df2['value'].mul(5)))
values = (df2.assign(**df2['variable'].str.split('_', expand=True)
.set_axis(['Output category', 'Input category'],
axis=1))
.sort_values(['Country', 'Input category']))['Output category'].str.extract('(\d+)').values
df['Output category'] = values
print(df)
import pandas as pd
d={'Country':['Algeria', 'France', 'Italy']*10, 'Input category':[1,2]*15, 'Output category':[0,0,0]*10}
df=pd.DataFrame(d)
df.sort_values(['Country', 'Input category']).reset_index(drop=True)
Country Input category Output category
0 Algeria 1 0
1 Algeria 1 0
2 Algeria 1 0
3 Algeria 1 0
4 Algeria 1 0
5 Algeria 2 0
6 Algeria 2 0
7 Algeria 2 0
8 Algeria 2 0
9 Algeria 2 0
10 France 1 0
11 France 1 0
12 France 1 0
13 France 1 0
14 France 1 0
15 France 2 0
16 France 2 0
17 France 2 0
18 France 2 0
19 France 2 0
20 Italy 1 0
21 Italy 1 0
22 Italy 1 0
23 Italy 1 0
24 Italy 1 0
25 Italy 2 0
26 Italy 2 0
27 Italy 2 0
28 Italy 2 0
29 Italy 2 0
我有一个数据集,其中包含许多行,这些行是来自具有输入类别 (1,2) 的国家/地区的个人。 每个唯一行出现 5 次(同一行出现 5 次,然后下一行出现 5 次,依此类推)。 我想要做的是在我的 df 中创建一个新列(比如说输出)并根据条件分布为其分配另一个值(也是 1 或 2)。
d={'Country': ['Algeria', 'France', 'Italy'] , 'p1_1':[2/5,1/5,1/5], 'p2_1':[3/5,4/5,4/5], 'p1_2':[2/5,3/5,5/5], 'p2_2':[3/5,2/5,0]}
cond_prob=pd.DataFrame(d)
cond_prob
Country p1_1 p2_1 p1_2 p2_2
0 Algeria 0.4 0.6 0.4 0.6
1 France 0.2 0.8 0.6 0.4
2 Italy 0.2 0.8 1.0 0.0
例如,因为对于阿尔及利亚 p1_1(输出 P = 1,输入 = 1)= 2/5,我想将输出 1 分配给我的行 2(因此输出 2 到剩下的 3 行)。
已编辑:这是预期的输出:
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 1
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 1
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
IIUC,
n=5
#
#s = df['Country'].value_counts()
#assert s.nunique() == 1
#n = s.iloc[0] // df['Input category'].nunique()
#print(n)
##5
df = df.sort_values(['Country', 'Input category']).reset_index(drop=True)
df2 = cond_prob.melt('Country').sort_values(['Country'])
df['Output Category'] = (df2.reindex(df2.index.repeat(df2['value'].mul(n)))['variable']
.str.extract('(\d+)')[0].values.astype(int))
print(df)
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 2
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 2
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
如果您需要按Input category
排序:
df2 = cond_prob.melt('Country').sort_values('Country')
df2 = df2.reindex(df2.index.repeat(df2['value'].mul(5)))
values = (df2.assign(**df2['variable'].str.split('_', expand=True)
.set_axis(['Output category', 'Input category'],
axis=1))
.sort_values(['Country', 'Input category']))['Output category'].str.extract('(\d+)').values
df['Output category'] = values
print(df)