用于删除列数据框中字符串的特定部分的正则表达式 python
Regex to remove specific parts of a string in a column dataframe python
我正在处理一个包含地址的数据框,我想删除字符串的特定部分。比如说
而且我想删除字符串,因为将单词“REFERENCE:”和“reference:”放到了句子的末尾。我还想创建一个看起来像这样的新列(没有 REFERENCE:/reference: 和这些词的下一个字母)你能帮我用 Regex 做吗?
我希望它的新专栏看起来像这样:
您可以使用一些正则表达式来获得所需的结果。
df = pd.DataFrame({"address": ["Street Pases de la Reforma #200 REFERENCE: Green house", "Street Carranza #300 12 & 13 REFERENCE: There is a tree"]})
df.address.str.findall(r".+?(?=REFERENCE)").explode()
0 Street Pases de la Reforma #200
1 Street Carranza #300 12 & 13
正则表达式模式的解释:
.+? matches any character (except for line terminators)
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=REFERENCE)
正则表达式应如下所示:
import re
discard_re = re.compile('(reference:.*)', re.IGNORECASE | re.MULTILINE)
然后您可以添加新列:
df['address_new'] = df.addresses.map(lambda x: discard_re.sub('', x))
我正在处理一个包含地址的数据框,我想删除字符串的特定部分。比如说
而且我想删除字符串,因为将单词“REFERENCE:”和“reference:”放到了句子的末尾。我还想创建一个看起来像这样的新列(没有 REFERENCE:/reference: 和这些词的下一个字母)你能帮我用 Regex 做吗?
我希望它的新专栏看起来像这样:
您可以使用一些正则表达式来获得所需的结果。
df = pd.DataFrame({"address": ["Street Pases de la Reforma #200 REFERENCE: Green house", "Street Carranza #300 12 & 13 REFERENCE: There is a tree"]})
df.address.str.findall(r".+?(?=REFERENCE)").explode()
0 Street Pases de la Reforma #200
1 Street Carranza #300 12 & 13
正则表达式模式的解释:
.+? matches any character (except for line terminators)
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=REFERENCE)
正则表达式应如下所示:
import re
discard_re = re.compile('(reference:.*)', re.IGNORECASE | re.MULTILINE)
然后您可以添加新列:
df['address_new'] = df.addresses.map(lambda x: discard_re.sub('', x))