Pandas:Groupby和select记录哪些ID出现了x次,其中n < x < N
Pandas: Groupby and select records which IDs appear x times, where n < x < N
在此数据框中,我需要 select 记录哪些 UserID 在数据集中出现了 x 次,其中 2 < x < 4:
d = {"UserId":[1,2,2,3,3,3,4,4,4,4],"review":["a","b","c","d","e","f","g","h","i","k"]}
f = pd.DataFrame(d)
UserId review
0 1 a
1 2 b
2 2 c
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
选择具有一个条件的记录有效:
f[f.groupby("UserId")["UserId"].transform('size') > 2]
UserId review
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
可以'用间隔中的用户ID数解决它。这不起作用:
def check_size(x):
return 2 < len(x) < 4
f['cnt'] = f.groupby('UserID')['UserID'].transform(check_size('size'))
失败:
...
...
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
860 in_axis, level, gpr = False, gpr, None
861 else:
--> 862 raise KeyError(gpr)
863 elif isinstance(gpr, Grouper) and gpr.key is not None:
864 # Add key to exclusions
KeyError: 'UserID'
使用between
:
out = f[f.groupby("UserId")["UserId"].transform('size')
.between(2, 4, inclusive='neither')]
print(out)
# Output
UserId review
3 3 d
4 3 e
5 3 f
更新
How to add cnt column, so f['cnt'] has count of Id occurrences?
out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
.loc[lambda x: x['cnt'].between(2, 4, inclusive='neither')]
# OR
out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
.query("cnt.between(2, 4, inclusive='neither')")
输出:
>>> out
UserId review cnt
3 3 d 3
4 3 e 3
5 3 f 3
使用 between
在 2 个值之间进行选择:
f[f.groupby('UserID')['UserID'].transform('size').between(3,5)]
输出:
UserID review
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
在此数据框中,我需要 select 记录哪些 UserID 在数据集中出现了 x 次,其中 2 < x < 4:
d = {"UserId":[1,2,2,3,3,3,4,4,4,4],"review":["a","b","c","d","e","f","g","h","i","k"]}
f = pd.DataFrame(d)
UserId review
0 1 a
1 2 b
2 2 c
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
选择具有一个条件的记录有效:
f[f.groupby("UserId")["UserId"].transform('size') > 2]
UserId review
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k
可以'用间隔中的用户ID数解决它。这不起作用:
def check_size(x):
return 2 < len(x) < 4
f['cnt'] = f.groupby('UserID')['UserID'].transform(check_size('size'))
失败:
...
...
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
860 in_axis, level, gpr = False, gpr, None
861 else:
--> 862 raise KeyError(gpr)
863 elif isinstance(gpr, Grouper) and gpr.key is not None:
864 # Add key to exclusions
KeyError: 'UserID'
使用between
:
out = f[f.groupby("UserId")["UserId"].transform('size')
.between(2, 4, inclusive='neither')]
print(out)
# Output
UserId review
3 3 d
4 3 e
5 3 f
更新
How to add cnt column, so f['cnt'] has count of Id occurrences?
out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
.loc[lambda x: x['cnt'].between(2, 4, inclusive='neither')]
# OR
out = f.assign(cnt=f.groupby("UserId")["UserId"].transform('size')) \
.query("cnt.between(2, 4, inclusive='neither')")
输出:
>>> out
UserId review cnt
3 3 d 3
4 3 e 3
5 3 f 3
使用 between
在 2 个值之间进行选择:
f[f.groupby('UserID')['UserID'].transform('size').between(3,5)]
输出:
UserID review
3 3 d
4 3 e
5 3 f
6 4 g
7 4 h
8 4 i
9 4 k