删除自定义停用词形成 python 中的短语
removing custom stop words form a phrase in python
我试图在进一步处理输入之前从用户输入中删除某些短语和单词,但在尝试这样做时我 运行 遇到了 "index out of range" 错误的问题我完全卡住了。我该如何解决这个问题?
我将我的输入短语作为一个字符串,我将其转换为一个列表以比较每个单词,并且我将停用词作为一个预定义列表。
输入示例:
["well","you","know","the","weather","is","awful"]
["you", "know", "what", "i", "mean", "so", "just", "turn", "the"、"lights"、"on"]
#Gets user input and removes the selected stop words from it and returns a filtered phrase back.
def stop_word_remover(phrase_list):
stop_words_lst = ["yo", "so", "well", "um", "a", "the","you know", "i mean"]
#initalize clean phrase string
clean_input_phrase= ""
#copying phrase_list into a new variable for stopword removal.
Copy_phrase_list = list(phrase_list)
#Cleanup loop
for i in range(1,len(phrase_list)):
has_stop_words = False
for x in range(len(stop_words_lst)):
has_stop_words = False
#if one of the stop words matches the word passed by the first main loop the flag is raised.
if (phrase_list[i-1]+" "+phrase_list[i]) == stop_words_lst[x].strip():
has_stop_words = True
# this if statement adds the word of the phrase only if the flag is not raised thus making sure all the stop words are filtered out
if has_stop_words == True:
Copy_phrase_list.remove(Copy_phrase_list[i-1])
Copy_phrase_list.remove(Copy_phrase_list[i-1])
#first for loop takes a individual words of the phrase given and makes a loop until the whole phrase goes through one word at a time
for i in range(len(Copy_phrase_list)):
#flag initialized for marking stop words
has_stop_words = False
#second loop takes all the stop words and compares them to the first word passed on by the first loop to sheck for a stop word
for x in range(len(stop_words_lst)):
#if one of the stop words matches the word passed by the first main loop the flag is raised.
if Copy_phrase_list[i] == stop_words_lst[x].strip():
has_stop_words = True
# this if statement adds the word of the phrase only if the flag is not raised thus making sure all the stop words are filtered out
if has_stop_words == False:
clean_input_phrase += str(Copy_phrase_list[i]) +" "
return clean_input_phrase
您需要将单词列表分开。一个应该用于单个单词,另一个应该用于短语。
single_word_list = ["yo", "so", "well", "um", "a", "the"]
phrase_list = ["you know", "i mean"]
for index, word in enumerate(Copy_phrase_list) :
if word in single_word_lst:
del Copy_phrase_list[index]
if word + " " + Copy_phrase_list[index+1] in phrase_list:
del Copy_phrase_list[index]
del Copy_phrase_list[index+1]
return " ".join(Copy_phrase_list)
然后您需要将 copy_phrase_list 转换为字符串并 return 它。
使用正则表达式替换功能。
用空字符串替换每个匹配项。
stop_words_lst = ['yo', 'so', 'well', 'um', 'a', 'the', 'you know', 'i mean']
s = "you know what i mean so just turn the lights on"
import re
for w in stop_words_lst:
pattern = r'\b'+w+r'\b'
s = re.sub(pattern, '', s)
print (s)
我试图在进一步处理输入之前从用户输入中删除某些短语和单词,但在尝试这样做时我 运行 遇到了 "index out of range" 错误的问题我完全卡住了。我该如何解决这个问题?
我将我的输入短语作为一个字符串,我将其转换为一个列表以比较每个单词,并且我将停用词作为一个预定义列表。
输入示例:
["well","you","know","the","weather","is","awful"]
["you", "know", "what", "i", "mean", "so", "just", "turn", "the"、"lights"、"on"]
#Gets user input and removes the selected stop words from it and returns a filtered phrase back.
def stop_word_remover(phrase_list):
stop_words_lst = ["yo", "so", "well", "um", "a", "the","you know", "i mean"]
#initalize clean phrase string
clean_input_phrase= ""
#copying phrase_list into a new variable for stopword removal.
Copy_phrase_list = list(phrase_list)
#Cleanup loop
for i in range(1,len(phrase_list)):
has_stop_words = False
for x in range(len(stop_words_lst)):
has_stop_words = False
#if one of the stop words matches the word passed by the first main loop the flag is raised.
if (phrase_list[i-1]+" "+phrase_list[i]) == stop_words_lst[x].strip():
has_stop_words = True
# this if statement adds the word of the phrase only if the flag is not raised thus making sure all the stop words are filtered out
if has_stop_words == True:
Copy_phrase_list.remove(Copy_phrase_list[i-1])
Copy_phrase_list.remove(Copy_phrase_list[i-1])
#first for loop takes a individual words of the phrase given and makes a loop until the whole phrase goes through one word at a time
for i in range(len(Copy_phrase_list)):
#flag initialized for marking stop words
has_stop_words = False
#second loop takes all the stop words and compares them to the first word passed on by the first loop to sheck for a stop word
for x in range(len(stop_words_lst)):
#if one of the stop words matches the word passed by the first main loop the flag is raised.
if Copy_phrase_list[i] == stop_words_lst[x].strip():
has_stop_words = True
# this if statement adds the word of the phrase only if the flag is not raised thus making sure all the stop words are filtered out
if has_stop_words == False:
clean_input_phrase += str(Copy_phrase_list[i]) +" "
return clean_input_phrase
您需要将单词列表分开。一个应该用于单个单词,另一个应该用于短语。
single_word_list = ["yo", "so", "well", "um", "a", "the"]
phrase_list = ["you know", "i mean"]
for index, word in enumerate(Copy_phrase_list) :
if word in single_word_lst:
del Copy_phrase_list[index]
if word + " " + Copy_phrase_list[index+1] in phrase_list:
del Copy_phrase_list[index]
del Copy_phrase_list[index+1]
return " ".join(Copy_phrase_list)
然后您需要将 copy_phrase_list 转换为字符串并 return 它。
使用正则表达式替换功能。 用空字符串替换每个匹配项。
stop_words_lst = ['yo', 'so', 'well', 'um', 'a', 'the', 'you know', 'i mean']
s = "you know what i mean so just turn the lights on"
import re
for w in stop_words_lst:
pattern = r'\b'+w+r'\b'
s = re.sub(pattern, '', s)
print (s)