使用组和条件中的第一个值创建新列(pandas、python、groupby)
creating new column using first value within group and conditions(pandas, python, groupby)
我正在尝试创建一个新列 (tpost),其中值来自基于函数条件的列 (post)。但是,新列中的值返回为空。
目标是在排除place=1的条件下使用同组的第一个值,但是仍然保留列中place=1的值(tpost)
我对缺失值的尝试:
╔═══════════╦═════════╦═════════╦═══════╦════╦═══════╦═══════╗
║ id ║ subject ║ class ║ place ║ hf ║ post ║ tpost ║
╠═══════════╬═════════╬═════════╬═══════╬════╬═══════╬═══════╣
║ 202006106 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.048 ║ 0.048 ║
║ 202006106 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.045 ║ 0.048 ║
║ 202006106 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.043 ║ 0.048 ║
║ 202007153 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.042 ║ 0.042 ║
║ 202007155 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.040 ║ 0.040 ║
║ 202009094 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.038 ║ 0.038 ║
║ 202009094 ║ B_Math ║ Class_4 ║ 8 ║ 1 ║ 0.037 ║ 0.038 ║
║ 202009095 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.036 ║ 0.036 ║
║ 202010143 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.034 ║ 0.034 ║
║ 202010143 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.033 ║ 0.034 ║
║ 202010145 ║ B_Math ║ Class_4 ║ 1 ║ 1 ║ 0.065 ║ ║
╚═══════════╩═════════╩═════════╩═══════╩════╩═══════╩═══════╝
正确结果:
╔═══════════╦═════════╦═════════╦═══════╦════╦═══════╦═══════╗
║ id ║ subject ║ class ║ place ║ hf ║ post ║ tpost ║
╠═══════════╬═════════╬═════════╬═══════╬════╬═══════╬═══════╣
║ 202006106 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.048 ║ 0.048 ║
║ 202006106 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.045 ║ 0.048 ║
║ 202006106 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.043 ║ 0.048 ║
║ 202007153 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.042 ║ 0.042 ║
║ 202007155 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.040 ║ 0.040 ║
║ 202009094 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.038 ║ 0.038 ║
║ 202009094 ║ B_Math ║ Class_4 ║ 8 ║ 1 ║ 0.037 ║ 0.038 ║
║ 202009095 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.036 ║ 0.036 ║
║ 202010143 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.034 ║ 0.034 ║
║ 202010143 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.033 ║ 0.034 ║
║ 202010145 ║ B_Math ║ Class_4 ║ 1 ║ 1 ║ 0.065 ║ 0.065 ║
╚═══════════╩═════════╩═════════╩═══════╩════╩═══════╩═══════╝
这是我使用过的 2 个代码:
code1:
df=(df.set_index(['id', 'subject', 'class', 'hf'])
.assign(tpost=df[df['place']>1].groupby(['id', 'subject', 'class', 'hf'])['post'].first())
.reset_index())
code2:
df=(df[df['place']>1].groupby(['id', 'subject', 'class', 'hf']).post.transform('first')
.pipe(lambda x: df.assign(tpost = x))
.fillna(''))
orig df code:
data = {'id':['202006106','202006106','202006106','202007153','202007155','202009094',
'202009094','202009095','202010143','202010143','202010145'],
'subject':['B_Math','B_Math','B_Math','B_Math','B_Math','B_Math',
'B_Math','B_Math','B_Math','B_Math','B_Math'],
'class':['Class_4','Class_4','Class_4','Class_4','Class_4','Class_4',
'Class_4','Class_4','Class_4','Class_4','Class_4'],
'hf':[1,1,1,1,1,1,1,1,1,1,1],
'place':[4,6,7,4,6,2,8,4,2,7,1],
'post':[0.048,0.045,0.043,0.042,0.040,0.038,0.037,0.036,0.034,0.033,0.065]}
df = pd.DataFrame(data)
请指教。非常感谢
这是一种方法。您可以屏蔽位置 1 的值并在 groupby
+transform
:
之后恢复它们
group = df.groupby(['id', 'subject', 'class', 'hf']).ngroup()
new = df['post'].mask(df['place'].eq(1)).groupby(group).transform('first')
df['tpost'] = new.mask(new.isna(), df['post'])
输出:
id subject class hf place post tpost
0 202006106 B_Math Class_4 1 4 0.048 0.048
1 202006106 B_Math Class_4 1 6 0.045 0.048
2 202006106 B_Math Class_4 1 7 0.043 0.048
3 202007153 B_Math Class_4 1 4 0.042 0.042
4 202007155 B_Math Class_4 1 6 0.040 0.040
5 202009094 B_Math Class_4 1 2 0.038 0.038
6 202009094 B_Math Class_4 1 8 0.037 0.038
7 202009095 B_Math Class_4 1 4 0.036 0.036
8 202010143 B_Math Class_4 1 2 0.034 0.034
9 202010143 B_Math Class_4 1 7 0.033 0.034
10 202010145 B_Math Class_4 1 1 0.065 0.065
我正在尝试创建一个新列 (tpost),其中值来自基于函数条件的列 (post)。但是,新列中的值返回为空。
目标是在排除place=1的条件下使用同组的第一个值,但是仍然保留列中place=1的值(tpost)
我对缺失值的尝试:
╔═══════════╦═════════╦═════════╦═══════╦════╦═══════╦═══════╗ ║ id ║ subject ║ class ║ place ║ hf ║ post ║ tpost ║ ╠═══════════╬═════════╬═════════╬═══════╬════╬═══════╬═══════╣ ║ 202006106 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.048 ║ 0.048 ║ ║ 202006106 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.045 ║ 0.048 ║ ║ 202006106 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.043 ║ 0.048 ║ ║ 202007153 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.042 ║ 0.042 ║ ║ 202007155 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.040 ║ 0.040 ║ ║ 202009094 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.038 ║ 0.038 ║ ║ 202009094 ║ B_Math ║ Class_4 ║ 8 ║ 1 ║ 0.037 ║ 0.038 ║ ║ 202009095 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.036 ║ 0.036 ║ ║ 202010143 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.034 ║ 0.034 ║ ║ 202010143 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.033 ║ 0.034 ║ ║ 202010145 ║ B_Math ║ Class_4 ║ 1 ║ 1 ║ 0.065 ║ ║ ╚═══════════╩═════════╩═════════╩═══════╩════╩═══════╩═══════╝
正确结果:
╔═══════════╦═════════╦═════════╦═══════╦════╦═══════╦═══════╗ ║ id ║ subject ║ class ║ place ║ hf ║ post ║ tpost ║ ╠═══════════╬═════════╬═════════╬═══════╬════╬═══════╬═══════╣ ║ 202006106 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.048 ║ 0.048 ║ ║ 202006106 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.045 ║ 0.048 ║ ║ 202006106 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.043 ║ 0.048 ║ ║ 202007153 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.042 ║ 0.042 ║ ║ 202007155 ║ B_Math ║ Class_4 ║ 6 ║ 1 ║ 0.040 ║ 0.040 ║ ║ 202009094 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.038 ║ 0.038 ║ ║ 202009094 ║ B_Math ║ Class_4 ║ 8 ║ 1 ║ 0.037 ║ 0.038 ║ ║ 202009095 ║ B_Math ║ Class_4 ║ 4 ║ 1 ║ 0.036 ║ 0.036 ║ ║ 202010143 ║ B_Math ║ Class_4 ║ 2 ║ 1 ║ 0.034 ║ 0.034 ║ ║ 202010143 ║ B_Math ║ Class_4 ║ 7 ║ 1 ║ 0.033 ║ 0.034 ║ ║ 202010145 ║ B_Math ║ Class_4 ║ 1 ║ 1 ║ 0.065 ║ 0.065 ║ ╚═══════════╩═════════╩═════════╩═══════╩════╩═══════╩═══════╝
这是我使用过的 2 个代码:
code1:
df=(df.set_index(['id', 'subject', 'class', 'hf'])
.assign(tpost=df[df['place']>1].groupby(['id', 'subject', 'class', 'hf'])['post'].first())
.reset_index())
code2:
df=(df[df['place']>1].groupby(['id', 'subject', 'class', 'hf']).post.transform('first')
.pipe(lambda x: df.assign(tpost = x))
.fillna(''))
orig df code:
data = {'id':['202006106','202006106','202006106','202007153','202007155','202009094',
'202009094','202009095','202010143','202010143','202010145'],
'subject':['B_Math','B_Math','B_Math','B_Math','B_Math','B_Math',
'B_Math','B_Math','B_Math','B_Math','B_Math'],
'class':['Class_4','Class_4','Class_4','Class_4','Class_4','Class_4',
'Class_4','Class_4','Class_4','Class_4','Class_4'],
'hf':[1,1,1,1,1,1,1,1,1,1,1],
'place':[4,6,7,4,6,2,8,4,2,7,1],
'post':[0.048,0.045,0.043,0.042,0.040,0.038,0.037,0.036,0.034,0.033,0.065]}
df = pd.DataFrame(data)
请指教。非常感谢
这是一种方法。您可以屏蔽位置 1 的值并在 groupby
+transform
:
group = df.groupby(['id', 'subject', 'class', 'hf']).ngroup()
new = df['post'].mask(df['place'].eq(1)).groupby(group).transform('first')
df['tpost'] = new.mask(new.isna(), df['post'])
输出:
id subject class hf place post tpost
0 202006106 B_Math Class_4 1 4 0.048 0.048
1 202006106 B_Math Class_4 1 6 0.045 0.048
2 202006106 B_Math Class_4 1 7 0.043 0.048
3 202007153 B_Math Class_4 1 4 0.042 0.042
4 202007155 B_Math Class_4 1 6 0.040 0.040
5 202009094 B_Math Class_4 1 2 0.038 0.038
6 202009094 B_Math Class_4 1 8 0.037 0.038
7 202009095 B_Math Class_4 1 4 0.036 0.036
8 202010143 B_Math Class_4 1 2 0.034 0.034
9 202010143 B_Math Class_4 1 7 0.033 0.034
10 202010145 B_Math Class_4 1 1 0.065 0.065