使用 replace() 从 Python 中的文本数据中删除完整的句子

Question

我正在尝试从文本数据的段落中删除三个句子。我有一个 pandas 数据框，其中包含几行段落，我想从中删除相同的三个句子。例如，

import pandas as pd

df_1 = pd.DataFrame({"text": ["the dog is red. He goes outside and runs.", 
                              "i like dogs because they are fun. i don't like that dogs bark at mailmen", 
                              "dogs bark at mailmen and i think its funny."]})
    
custom_stopwords = ["the dog is red", "i like dogs", "dogs bark at mailmen"]
 
for i in custom_stopwords: 
    df_1['text'] = df_1['text'].str.replace(i, '')

这个方法在我提供的这个例子中有效，但它不适用于我的实际数据。我拥有的数据非常大，但我不明白为什么在这种情况下这很重要。正在发生的事情是我的一些句子将被删除，而另一些则不会。例如，我无法删除单词“installation(s)”而不用“/”挡住括号。

Answer 1

pandas.Series.str.replace 有一个默认的关键字参数 regex=True 这意味着它假定替换是正则表达式（比如你的“安装（s）”可以被解释）。您正在尝试替换字符串文字（或至少是非正则表达式）。添加 regex=False 应该可以正常工作：

for i in custom_stopwords: 
    df_1['text'] = df_1['text'].str.replace(i, '', regex=False)

Answer 2

使用 str.replace 和参数 regex=False。 (s) 被解释为正则表达式组，在这种特定情况下等于字符 s.

使用 replace() 从 Python 中的文本数据中删除完整的句子

Using replace() to remove full sentences from text data in Python

python

text

replace

str-replace

pandas