如何将某些单词视为 nltk Python 中的分隔符?
How to treat certain words as delimiters in nltk Python?
我正在尝试使用停用词('is'、'the'、'was')作为分隔符来标记以下文本
预期的输出是这样的:
['Walter',
'feeling anxious',
'He',
'diagnosed today,'
'He probably',
'best person I know']
这是我试图生成上述输出的代码
import nltk
stopwords = ['is', 'the', 'was']
sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))
我的代码输出是这样的:
['Walter feeling anxious .',
'He diagnosed today .',
'He probably best person I know .']
我怎样才能得到预期的输出?
所以这个问题同时考虑了停用词和行分隔符。假设我们可以通过符号 .
定义一条线,您可以使用 re.split()
.
将其引入多个拆分
import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)
results
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know',
'']
因为我们同时使用单个 .
和后面有空格的 .
,拆分结果将 return 和一个额外的 ''
。假设这个句子结构是一致的,你可以切片结果得到你预期的结果。
result[:-1]
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know']
我正在尝试使用停用词('is'、'the'、'was')作为分隔符来标记以下文本
预期的输出是这样的:
['Walter',
'feeling anxious',
'He',
'diagnosed today,'
'He probably',
'best person I know']
这是我试图生成上述输出的代码
import nltk
stopwords = ['is', 'the', 'was']
sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))
我的代码输出是这样的:
['Walter feeling anxious .',
'He diagnosed today .',
'He probably best person I know .']
我怎样才能得到预期的输出?
所以这个问题同时考虑了停用词和行分隔符。假设我们可以通过符号 .
定义一条线,您可以使用 re.split()
.
import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)
results
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know',
'']
因为我们同时使用单个 .
和后面有空格的 .
,拆分结果将 return 和一个额外的 ''
。假设这个句子结构是一致的,你可以切片结果得到你预期的结果。
result[:-1]
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know']