删除 nlp 任务的自定义停用词列表
Removing a custom list of stopwords for an nlp task
我写了一个函数来清理我的文本语料库,它的形式如下:
["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
.....,
"touchstone pictures..about it . okay ? "]
这是一个由逗号分隔的句子列表。
我的职能是:
def clean_sentences(sentences):
sentences = (re.sub(r'\d+','£', s) for s in sentences
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
它将数字替换为“£”,但不会删除停用词。
输出:
'wild things is a suspenseful thriller...
and a £ . £ rating , it\'s still watchable , just don\'t think about it . okay ? '
我不明白为什么。
谢谢。
我相信这是因为您使用正则表达式在代码中用数字替换了符号 £。为了澄清:
sentences = (re.sub(r'\d+','£', s) for s in sentences
这是一段用该符号替换任何数字的代码。我看到您定义了停用词列表,然后创建了一个没有这些停用词的新列表。但是,您用来替换数字的符号 £
不在停用词列表中,因此它不会被排除在新列表中。您可以尝试将其添加到停用词列表中,如下所示:
def clean_sentences(sentences):
sentences = (re.sub(r'\d+','£', s) for s in sentences)
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it', '£']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
希望对您有所帮助!
编辑:
我也相信您的原始代码可能有问题。看来您正在尝试使用 sentences = ' '.join(w for w in sentences if w not in stopwords)
来连接您的句子并删除所有停用词。但是,这是对 not in
运算符工作方式的无效使用。 not in
运算符只检查列表中的特定单词,而不是整个句子。基本上,它不会使用您的停用词删除任何内容,因为它无法检测整个句子中是否有停用词。您要做的是首先将每个句子拆分成一堆单词,然后使用您已经创建的相同 .join
方法创建一个新列表。这将使 not in
运算符可以检查每个单词并在它是停用词时将其删除。
当您实际上想将句子中的单词与停用词进行比较时,您将整个句子与停用词进行了比较。
import re
sentences = ["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
"touchstone pictures..about it . okay ? "]
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is', 'it']
作为一个循环:
def clean_sentences(sentences):
new_sentences = []
for sentence in sentences:
new_sentence = sentence.split()
new_sentence = [re.sub(r'\d+', '£', word) for word in new_sentence]
new_sentence = [word for word in new_sentence if word not in stopwords]
new_sentence = " ".join(new_sentence)
new_sentences.append(new_sentence)
return new_sentences
或者,更紧凑,作为列表理解:
def clean_sentences(sentences):
return [" ".join([re.sub(r'\d+', '£', word) for word in sentence.split() if word not in stopwords]) for sentence in sentences]
这两个return:
print(clean_sentences(sentences))
> ['wild things suspenseful .. twists .', 'i know already.. film goers .', 'touchstone pictures..about . okay ?']
我写了一个函数来清理我的文本语料库,它的形式如下:
["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
.....,
"touchstone pictures..about it . okay ? "]
这是一个由逗号分隔的句子列表。
我的职能是:
def clean_sentences(sentences):
sentences = (re.sub(r'\d+','£', s) for s in sentences
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
它将数字替换为“£”,但不会删除停用词。
输出:
'wild things is a suspenseful thriller...
and a £ . £ rating , it\'s still watchable , just don\'t think about it . okay ? '
我不明白为什么。 谢谢。
我相信这是因为您使用正则表达式在代码中用数字替换了符号 £。为了澄清: sentences = (re.sub(r'\d+','£', s) for s in sentences
这是一段用该符号替换任何数字的代码。我看到您定义了停用词列表,然后创建了一个没有这些停用词的新列表。但是,您用来替换数字的符号 £
不在停用词列表中,因此它不会被排除在新列表中。您可以尝试将其添加到停用词列表中,如下所示:
def clean_sentences(sentences):
sentences = (re.sub(r'\d+','£', s) for s in sentences)
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it', '£']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
希望对您有所帮助!
编辑:
我也相信您的原始代码可能有问题。看来您正在尝试使用 sentences = ' '.join(w for w in sentences if w not in stopwords)
来连接您的句子并删除所有停用词。但是,这是对 not in
运算符工作方式的无效使用。 not in
运算符只检查列表中的特定单词,而不是整个句子。基本上,它不会使用您的停用词删除任何内容,因为它无法检测整个句子中是否有停用词。您要做的是首先将每个句子拆分成一堆单词,然后使用您已经创建的相同 .join
方法创建一个新列表。这将使 not in
运算符可以检查每个单词并在它是停用词时将其删除。
当您实际上想将句子中的单词与停用词进行比较时,您将整个句子与停用词进行了比较。
import re
sentences = ["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
"touchstone pictures..about it . okay ? "]
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is', 'it']
作为一个循环:
def clean_sentences(sentences):
new_sentences = []
for sentence in sentences:
new_sentence = sentence.split()
new_sentence = [re.sub(r'\d+', '£', word) for word in new_sentence]
new_sentence = [word for word in new_sentence if word not in stopwords]
new_sentence = " ".join(new_sentence)
new_sentences.append(new_sentence)
return new_sentences
或者,更紧凑,作为列表理解:
def clean_sentences(sentences):
return [" ".join([re.sub(r'\d+', '£', word) for word in sentence.split() if word not in stopwords]) for sentence in sentences]
这两个return:
print(clean_sentences(sentences))
> ['wild things suspenseful .. twists .', 'i know already.. film goers .', 'touchstone pictures..about . okay ?']