正则表达式查找多个模式之前的数字序列，放入新列（Python，Pandas）

Question

这是我的示例数据：

import pandas as pd
import re
  
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
          1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
          2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
          3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
          4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
          5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
         'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})

这是我的期望输出：

我创建了一个名为 'HP' 的新列，我想在其中从原始列 ('Engine Information') 中提取马力数据

这里是我尝试过的代码：

cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\d+(?=\shp|hp)', str(x)))

我想用正则表达式匹配模式：“'hp' 或“hp”之前的数字序列。这是因为某些单元格在数字和 'hp' 之间没有 'space'，如我的示例所示。

我确定正则表达式是正确的，因为我已经在 R 中成功完成了类似的过程。但是，我已经尝试过 str.extract、re.findall、re.search 等函数, re.match。返回错误或 'None' 值（如示例中所示）。所以在这里我有点迷茫。

谢谢！

Answer 1

这将在 hp 之前获取数值，不带或带（单个或多个）空格。

r'\d+(?=\s+hp|hp)'

您可以在此处验证正则表达式：https://regex101.com/r/pXySxm/1

Answer 2

您可以使用 str.extract:

cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)

详情

(\d+)\s*hp\b - 将一个或多个数字匹配并捕获到组 1 中，然后仅匹配 0 个或多个空格 (\s*) 和 hp（由于不区分大小写flags=re.I) 作为一个完整的单词（因为 \b 标记了单词边界）
str.extract 仅 returns 捕获值，因此 hp 和空格不是结果的一部分。

Python 演示结果：

>>> cars
                               Engine Information   HP
0         Honda 2.4L 4 cylinder 190 hp 162 ft-lbs  190
1  Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs  420
2          Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs  390
3          MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs  118
4       Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV  360
5           GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs  352

Answer 3

有几个问题：

re.match 只查看字符串的开头，如果您的模式可能出现在任何地方，请使用 re.search
如果您使用原始字符串，请不要转义，即 '\d hp' 或 r'\d hp' - 原始字符串可以帮助您避免转义
Return匹配组。您只是搜索但不产生找到的组。 re.search(rex, string) 给你一个复杂的对象（一个匹配对象），你可以从中提取所有组，例如re.search(rex, string)[0]
你必须将访问包装在一个单独的函数中，因为你必须在访问组之前检查是否有任何匹配项。如果您不这样做，异常可能会在中间停止应用过程
申请很慢；使用 pandas 向量化函数，例如 extract: cars['Engine Information'].str.extract(r'(\d+) ?hp')

您的方法应该适用于此：

def match_horsepower(s):
    m = re.search(r'(\d+) ?hp', s)
    return int(m[1]) if m else None

cars['HP'] = cars['Engine Information'].apply(match_horsepower)

正则表达式查找多个模式之前的数字序列，放入新列（Python，Pandas）

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

python

regex

pandas

spyder