nltk 只处理 txt 文件中的最后一个字符串
nltk only processing last string in txt file
我有一个包含四个字符串的 .txt 文件,全部由换行符分隔。
当我标记文件时,它处理了每一行数据,这是完美的。
但是,当我尝试从文件中删除停用词时,它只会从最后一个字符串中删除停用词。
我想处理文件中的所有内容,而不仅仅是最后一句话。
我的代码:
with open ('example.txt') as fin:
for tkn in fin:
print(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
words = word_tokenize(tkn)
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)
输出:
['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED: ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']
如上所示,它只处理最后一行。
编辑:
我的txt文件内容:
this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.
smile smiling smiled
there are multiple words here that you should be able to use for lemmas/synonyms.
考虑在 readline 循环中合并删除停用词功能,如下所示:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
with open("d:/example.txt") as the_file:
for each_line in the_file:
print(nltk.word_tokenize(each_line))
words = nltk.word_tokenize(each_line)
stp_words_removed = []
for word in words:
if word not in stop_words:
stp_words_removed.append(word)
print("STOP WORDS REMOVED: ", stp_words_removed)
根据您的描述,您似乎只将最后一行输入了停用词删除器。我不明白的是,如果是这种情况,您不应该得到所有这些空列表。
您需要将 word_tokenize 的结果追加到列表中,然后处理该列表。在您的示例中,您只在迭代文件后获取文件的最后一行。
尝试:
words = []
with open ('example.txt') as fin:
for tkn in fin:
if tkn:
words.append(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)
我有一个包含四个字符串的 .txt 文件,全部由换行符分隔。
当我标记文件时,它处理了每一行数据,这是完美的。
但是,当我尝试从文件中删除停用词时,它只会从最后一个字符串中删除停用词。
我想处理文件中的所有内容,而不仅仅是最后一句话。
我的代码:
with open ('example.txt') as fin:
for tkn in fin:
print(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
words = word_tokenize(tkn)
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)
输出:
['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED: ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']
如上所示,它只处理最后一行。
编辑: 我的txt文件内容:
this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.
smile smiling smiled
there are multiple words here that you should be able to use for lemmas/synonyms.
考虑在 readline 循环中合并删除停用词功能,如下所示:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
with open("d:/example.txt") as the_file:
for each_line in the_file:
print(nltk.word_tokenize(each_line))
words = nltk.word_tokenize(each_line)
stp_words_removed = []
for word in words:
if word not in stop_words:
stp_words_removed.append(word)
print("STOP WORDS REMOVED: ", stp_words_removed)
根据您的描述,您似乎只将最后一行输入了停用词删除器。我不明白的是,如果是这种情况,您不应该得到所有这些空列表。
您需要将 word_tokenize 的结果追加到列表中,然后处理该列表。在您的示例中,您只在迭代文件后获取文件的最后一行。
尝试:
words = []
with open ('example.txt') as fin:
for tkn in fin:
if tkn:
words.append(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)