如果行仅包含停用词中的任何一个,则从文本文件中删除这些行
Remove those lines from text file if line contains just any of from the stopwords
我只想从 Myfile.txt
文件中删除这些行,前提是该行仅包含且仅包含停用词中的任何一个
例如Myfile.txt
文件的样本是
Adh Dhayd
Abu Dhabi is # here is "is" stopword but this line should not be removed because line contain #Abu Dhabi is
Zaranj
of # this line contains just stop word, this line should be removed
on # this line contains just stop word, this line should be removed
Taloqan
Shnan of # here is "of" stopword but this line should not be removed because line contain #Shnan of
is # this line contains just stop word, this line should be removed
Shibirghn
Shahrak
from # this line contains just stop word, this line should be removed
我以这段代码为例
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
那么根据上面提到的 Myfile.txt
的解决方案代码是什么。
您可以查看该行是否与任何停用词匹配,如果不匹配,则将其附加到过滤后的内容中。那就是如果你想过滤所有只包含一个 stop_word
的行。如果包含多个停用词的行也应被过滤,请尝试标记该行,并与 stop_words:
建立交集
f = open("test.txt","r+")
filtered_content = []
stop_words = set(stopwords.words('english'))
for line in f.read().splitlines():
if not line in stop_words:
filtered_content.append(line)
g = open("test_filter.txt","a+")
g.write("\n".join(filtered_content))
g.close()
f.close()
如果您想要删除多个停用词,请使用此 if 语句。这将删除仅包含停用词的行。如果一个词不是停用词,则保留该行:
if not len(set(word_tokenize(line)).intersection(stop_words)) == len(word_tokenize(line)):
我只想从 Myfile.txt
文件中删除这些行,前提是该行仅包含且仅包含停用词中的任何一个
例如Myfile.txt
文件的样本是
Adh Dhayd
Abu Dhabi is # here is "is" stopword but this line should not be removed because line contain #Abu Dhabi is
Zaranj
of # this line contains just stop word, this line should be removed
on # this line contains just stop word, this line should be removed
Taloqan
Shnan of # here is "of" stopword but this line should not be removed because line contain #Shnan of
is # this line contains just stop word, this line should be removed
Shibirghn
Shahrak
from # this line contains just stop word, this line should be removed
我以这段代码为例
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
那么根据上面提到的 Myfile.txt
的解决方案代码是什么。
您可以查看该行是否与任何停用词匹配,如果不匹配,则将其附加到过滤后的内容中。那就是如果你想过滤所有只包含一个 stop_word
的行。如果包含多个停用词的行也应被过滤,请尝试标记该行,并与 stop_words:
f = open("test.txt","r+")
filtered_content = []
stop_words = set(stopwords.words('english'))
for line in f.read().splitlines():
if not line in stop_words:
filtered_content.append(line)
g = open("test_filter.txt","a+")
g.write("\n".join(filtered_content))
g.close()
f.close()
如果您想要删除多个停用词,请使用此 if 语句。这将删除仅包含停用词的行。如果一个词不是停用词,则保留该行:
if not len(set(word_tokenize(line)).intersection(stop_words)) == len(word_tokenize(line)):