在 Pandas groupby 中查找满足多个键的所有值

Question

我为 spacy 从文本创建的文档生成了一个数据框，如下所示：

test='We walked the walk and still walk it today. Walking brings us great joy.'
tokens=[]
lemma=[]
pos=[]

df=pd.DataFrame()

doc=nlp(test)
for t in doc:
    tokens.append(t.text)
    lemma.append(t.lemma_)
    pos.append(t.pos_)
df['tokens']=tokens
df['lemma']=lemma
df['pos']=pos

df

     tokens   lemma    pos
0        We  -PRON-   PRON
1    walked    walk   VERB
2       the     the    DET
3      walk    walk   NOUN
4       and     and  CCONJ
5     still   still    ADV
6      walk    walk   VERB
7        it  -PRON-   PRON
8     today   today   NOUN
9         .       .  PUNCT
10  Walking    walk   VERB
11   brings   bring   VERB
12       us  -PRON-   PRON
13    great   great    ADJ
14      joy     joy   NOUN
15        .       .  PUNCT

我按 ('lemma', 'pos')

分组

groups_multipe=df.groupby(['lemma','pos'])

我想找到同时拥有 pos 'VERB' 和 'NOUN' 的所有引理。我尝试使用 .apply() 和 .fliter()，但我不擅长

例如，引理'walk'满足要求，因为它在'pos'列中同时有'VERB'和'NOUN'。

如何实现

加法：

最后用笨办法实现了：集合动词与名词的交集

这是我的代码：

lemma_v=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='VERB')
lemma_n=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='NOUN')

lemma_vn=list(lemma_v & lemma_n)

效率太低了，但我不知道有什么更好的方法。有人有改进的想法吗？

Answer 1

使用 groupby_transform 创建布尔掩码和 select 右行：

# custom function to check if 'lemma' is in 'VERB' and 'NOUN'
is_verb_and_noun = lambda x: set(x) == set(['VERB', 'NOUN'])

out = df.loc[df.groupby('lemma')['pos'].transform(is_verb_and_noun), 'lemma']
print(out)

# Output:
1     walk
3     walk
6     walk
10    walk
Name: lemma, dtype: object

最终输出：

>>> out.unique().tolist()
['walk']

在 Pandas groupby 中查找满足多个键的所有值

Find all value which satisfy multiple keys in Pandas groupby

dataframe

pandas

pandas-groupby