仅在 pandas 列中保留匹配的词

Keep only matched words in pandas column

我只想保留列表中出现的那些词。所有其他词都应该被删除。(pandas 数据框)

cuisine_list = ['breakfast', 'american', 'tea', 'chicken']
name cuisine
dominos pizza breakfast american tea dine in
kfc american chicken play area

结果应该是这样的-

name cuisine
dominos pizza breakfast american tea
kfc american chicken

我正在使用以下代码,但它花费了很多时间。

 file1_cuisine = file1[["Cuisine"]]

for index, row in file1_cuisine.iterrows():
    words_to_keep = []
    for word in row[0].split(' '):
        if word in words_to_match :
            words_to_keep.append(word + ' ')
    file1_cuisine.loc[index, 'final_input_text']= ''.join(words_to_keep)

使用set intersection using & with df.apply and Series.str.split:

In [760]: y = set(cuisine_list)
In [766]: df['cuisine'] = df['cuisine'].str.split().apply(lambda x: list(set(x) & y)).str.join(',')
    
In [767]: df
Out[767]: 
            name                 cuisine
0  dominos pizza  tea,american,breakfast
1            kfc        chicken,american

将 lambda 函数与 split 结合使用并设置交集,最后连接值 ,:

cuisine_list = ['breakfast', 'american', 'tea', 'chicken']
df['cuisine'] = df['cuisine'].apply(lambda x: ','.join(set(x.split()).intersection(cuisine_list)))

print (df)
            name                 cuisine
0  dominos pizza  tea,breakfast,american
1            kfc        chicken,american

或使用Series.str.findall:

cuisine_list = ['breakfast', 'american', 'tea', 'chicken']

pat = '|'.join(r"\b{}\b".format(x) for x in cuisine_list)
df['cuisine'] = df['cuisine'].str.findall(rf'{pat}').str.join(',')

print (df)
            name                 cuisine
0  dominos pizza  breakfast,american,tea
1            kfc        american,chicken