根据列表从 pandas 系列中删除停用词
Removing stopwords from a pandas series based on list
我有以下名为句子的数据框
data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
还有一个名为停用词的数据框:
data = [["the"], ["it"], ["best"], [ "is"]]
stopwords = pd.DataFrame(data, columns = ['word'])
我想从句子 [“句子”] 中删除所有停用词。我尝试了下面的代码,但它不起作用。我认为我的 if 语句有问题。有人可以帮忙吗?
Def remove_stopwords(input_string, stopwords_list):
stopwords_list = list(stopwords_list)
my_string_split = input_string.split(' ')
my_string = []
for word in my_string_split:
if word not in stopwords_list:
my_string.append(word)
my_string = " ".join(my_string)
return my_string
sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)
当我应用该函数时,它只是 returns 句子中的第一个或前几个字符串,但根本没有删除停用词。有点卡在这里
您可以将停用词转换为列表,并使用列表理解从句子中删除这些词,
stopword_list = stopwords['word'].tolist()
sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))
你得到
0 Home of Jacksons
1 Is real thing?
2 What with you?
3 Tomatoes are
4 I think it's to path ways now
或者您可以将代码包装在一个函数中,
def remove_stopwords(input_string, stopwords_list):
my_string = []
for word in input_string.split():
if word not in stopwords_list:
my_string.append(word)
return " ".join(my_string)
stopword_list = stopwords['word'].tolist()
sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))
您上面的代码中有很多语法错误。如果您将停用词保留为列表(或集合)而不是 DataFrame,则以下内容将起作用 -
data = ["Home of the Jacksons", "Is it the real thing?", "What is it with you?", "Tomatoes are the best", "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
stopwords = ["the", "it", "best", "is"]
sentences.sentence.str.split().apply(lambda x: " ".join([y for y in x if y not in stopwords]))
成功的关键是将停用词列表转换为 set()
:集的查找时间为 O(1),而列表的时间为 O(N)。
stop_set = set(stopwords.word.tolist())
sentences.sentence.str.split()\
.apply(lambda x: ' '.join(w for w in x if w not in stop_set))
我有以下名为句子的数据框
data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
还有一个名为停用词的数据框:
data = [["the"], ["it"], ["best"], [ "is"]]
stopwords = pd.DataFrame(data, columns = ['word'])
我想从句子 [“句子”] 中删除所有停用词。我尝试了下面的代码,但它不起作用。我认为我的 if 语句有问题。有人可以帮忙吗?
Def remove_stopwords(input_string, stopwords_list):
stopwords_list = list(stopwords_list)
my_string_split = input_string.split(' ')
my_string = []
for word in my_string_split:
if word not in stopwords_list:
my_string.append(word)
my_string = " ".join(my_string)
return my_string
sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)
当我应用该函数时,它只是 returns 句子中的第一个或前几个字符串,但根本没有删除停用词。有点卡在这里
您可以将停用词转换为列表,并使用列表理解从句子中删除这些词,
stopword_list = stopwords['word'].tolist()
sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))
你得到
0 Home of Jacksons
1 Is real thing?
2 What with you?
3 Tomatoes are
4 I think it's to path ways now
或者您可以将代码包装在一个函数中,
def remove_stopwords(input_string, stopwords_list):
my_string = []
for word in input_string.split():
if word not in stopwords_list:
my_string.append(word)
return " ".join(my_string)
stopword_list = stopwords['word'].tolist()
sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))
您上面的代码中有很多语法错误。如果您将停用词保留为列表(或集合)而不是 DataFrame,则以下内容将起作用 -
data = ["Home of the Jacksons", "Is it the real thing?", "What is it with you?", "Tomatoes are the best", "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
stopwords = ["the", "it", "best", "is"]
sentences.sentence.str.split().apply(lambda x: " ".join([y for y in x if y not in stopwords]))
成功的关键是将停用词列表转换为 set()
:集的查找时间为 O(1),而列表的时间为 O(N)。
stop_set = set(stopwords.word.tolist())
sentences.sentence.str.split()\
.apply(lambda x: ' '.join(w for w in x if w not in stop_set))