从列表中删除自定义词（第二部分）- Python

Question

这是我之前话题的延续：

我有一个这样的 df:

df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})

<OUT>
PageNumber   new_tags
   175       flower architecture people...
   162       hair red bobbles...
   576       sweets chocolate shop...

还有另一个 df（将作为参考 df（见下文））：

top_words= pd.DataFrame({'ID': [1,2,3], 'tag':['flower, people, chocolate']})

<OUT>
   ID      tag
   1       flower
   2       people
   3       chocolate

我正在尝试根据另一个 df 的值删除 df 列表中的值。我希望获得的输出是：

<OUT> df
PageNumber   new_tags
   175       flower people
   576       chocolate

我试过内部连接方法：，但不幸的是没有成功。

所以我求助于标记化两个 df 列中的所有标签，并尝试遍历每个标签并仅保留引用 df 中的值。目前，它 returns 空列表...

df['tokenised_new_tags'] = filtered_new["new_tags"].astype(str).apply(nltk.word_tokenize)
topic_words['tokenised_top_words']= topic_words['tag'].astype(str).apply(nltk.word_tokenize)
df['top_word_tokens'] = [[t for t in tok_sent if t in topic_words['tokenised_top_words']] for tok_sent in df['tokenised_new_tags']]

非常感谢任何帮助 - 谢谢！

Answer 1

这个怎么样：

def remove_custom_words(phrase, words_to_remove_list):
    return([ elem for elem in phrase.split(' ') if elem not in words_to_remove_list])


df['new_tags'] = df['new_tags'].apply(lambda x: remove_custom_words(x[0],top_words['tag'].to_list()))

基本上我对数据集的每一行应用 remove_custom_words 函数。然后我们过滤掉top_words['tag']

中包含的词

从列表中删除自定义词（第二部分）- Python

Removing Custom-Defined Words from List (Part II)- Python

python

nlp

nltk

dataframe

pandas