可以根据唯一值删除数据框中的行吗?
Can one drop rows in a dataframe based on nunique values?
我想忽略职业少于 2 个唯一名称的行:
name value occupation
a 23 mechanic
a 24 mechanic
b 30 mechanic
c 40 mechanic
c 41 mechanic
d 30 doctor
d 20 doctor
e 70 plumber
e 71 plumber
f 30 plumber
g 50 tailor
我做到了:
df.groupby('ocuupation')['name'].nunique()
>>>>>>
occupation
mechanic 3
doctor 1
plumber 2
tailor 1
Name: name, dtype: int64
是否可以使用 df = df.drop(df[<some boolean condition>].index)
之类的东西?
期望的输出:
name value occupation
a 23 mechanic
a 24 mechanic
b 30 mechanic
c 40 mechanic
c 41 mechanic
e 70 plumber
e 71 plumber
f 30 plumber
使用 GroupBy.transform
with Series.ge
获取等于或大于的值,例如 2
:
df = df[df.groupby('occupation')['name'].transform('nunique').ge(2)]
print (df)
name value occupation
0 a 23 mechanic
1 a 24 mechanic
2 b 30 mechanic
3 c 40 mechanic
4 c 41 mechanic
7 e 70 plumber
8 e 71 plumber
9 f 30 plumber
您的解决方案是在 Series.isin
:
中比较的 Series 中过滤的索引值
s = df.groupby('occupation')['name'].nunique()
df = df[df['occupation'].isin(s[s.ge(2)].index)]
print (df)
name value occupation
0 a 23 mechanic
1 a 24 mechanic
2 b 30 mechanic
3 c 40 mechanic
4 c 41 mechanic
7 e 70 plumber
8 e 71 plumber
9 f 30 plumber
我想忽略职业少于 2 个唯一名称的行:
name value occupation
a 23 mechanic
a 24 mechanic
b 30 mechanic
c 40 mechanic
c 41 mechanic
d 30 doctor
d 20 doctor
e 70 plumber
e 71 plumber
f 30 plumber
g 50 tailor
我做到了:
df.groupby('ocuupation')['name'].nunique()
>>>>>>
occupation
mechanic 3
doctor 1
plumber 2
tailor 1
Name: name, dtype: int64
是否可以使用 df = df.drop(df[<some boolean condition>].index)
之类的东西?
期望的输出:
name value occupation
a 23 mechanic
a 24 mechanic
b 30 mechanic
c 40 mechanic
c 41 mechanic
e 70 plumber
e 71 plumber
f 30 plumber
使用 GroupBy.transform
with Series.ge
获取等于或大于的值,例如 2
:
df = df[df.groupby('occupation')['name'].transform('nunique').ge(2)]
print (df)
name value occupation
0 a 23 mechanic
1 a 24 mechanic
2 b 30 mechanic
3 c 40 mechanic
4 c 41 mechanic
7 e 70 plumber
8 e 71 plumber
9 f 30 plumber
您的解决方案是在 Series.isin
:
s = df.groupby('occupation')['name'].nunique()
df = df[df['occupation'].isin(s[s.ge(2)].index)]
print (df)
name value occupation
0 a 23 mechanic
1 a 24 mechanic
2 b 30 mechanic
3 c 40 mechanic
4 c 41 mechanic
7 e 70 plumber
8 e 71 plumber
9 f 30 plumber