如何检查嵌套列表中的列表项是否存在于集合中?

How to check if a list item within a nested list exists in a set?

我有一个语料库中每个句子的嵌套列表。该集合是出现不止一次的所有单词。我如何检查列表中的每个单词是否在只包含出现一次的单词的集合中? 然后我需要用 str UNK 替换所有出现不止一次的单词。

我试过了:

for sent in tokenized_sents:
    for word in sent:
        if word in set:
           word = '<UNK>'

您可以使用 collections.Counter

创建一个字典来跟踪语料库中每个单词的出现次数
from collections import Counter

corpus = [['Hello', ',', 'my', 'name', 'is', 'Walter'], ['I', 'like', 'my', 'cats']]

corpus_unnested = []
for sentence in corpus:
    corpus_unnested += sentence
my_dict = Counter(corpus_unnested)

for i, sentence in enumerate(corpus):
    for j, word in enumerate(sentence):
        if my_dict[word] > 1:
            corpus[i][j] = '<UNK>'
>>> print(corpus)
[['Hello', ',', '<UNK>', 'name', 'is', 'Walter'], ['I', 'like', '<UNK>', 'cats']]