用于排除单词和特定字符的正则表达式

Question

我需要使用正则表达式和 python 从我的数据集中的字符串中过滤掉特定字符。

我有包含业务和员工信息的数据。我想保留员工人数以及 skip/exclude 任何包含单词 'Self-employed' 的单元格我可以使用什么正则表达式来删除除所述字符之外的所有内容。
示例：
String: "Health, Wellness and Fitness, 501-1000 employees"
Desired Outcome: 501-1000

或者：
String: "Retail, 10,000+ employees"
Desired Outcome: 10,000+"

或者，如果单元格包含 'self-employed'，则应跳过该词，将其保留并转到下一个单元格：
String: Self-employed'
Desired Outcome: Self-employed"

我想要一种模式，它可以消除除期望结果中要求的以外的所有内容。这是我使用的代码，但它似乎没有任何改变，我做错了什么？

if 'employee' in row.keys():
        row['employee'] = re.sub("([0-9]+[,\-]*[0-9]*[+]?|Self-employed)", '', str(row['employee']))

Answer 1

re.sub 匹配正则表达式模式并替换匹配项。你不想那样做。你想要相反的 - 匹配模式并使用匹配。所以 re.sub 在这里似乎不是正确的方法。您可以改为使用 re.search 来查找与您的正则表达式匹配的组，然后将结果分配给您的 row['employee'] 变量。这是一个基于您目前提供的代码的示例。:

if 'employee' in row.keys():
    match = re.search("(\d[^\s]*|Self-employed)", str(row['employee']))
    if match:
        row['employee'] = match.group()

归功于 Porsche9II 的正则表达式优化。

用于排除单词和特定字符的正则表达式

A regular expression to exclude a word and specific characters

python

regex

csv

data-cleaning