从 pandas 数据框中的列中的字符串中提取数字

Question

我有一个名为 data 的数据框，我正在尝试清理数据框中的一列，以便我只能将价格转换为数值。这就是我过滤列以查找那些不正确值的方式。 data[data['incorrect_price'].astype(str).str.contains('[A-Za-z]')]

    Incorrect_Price    Occurences   errors
23  99 cents                732       1
50  3 dollars and 49 cents  211       1
72  the price is 625        128       3
86  new price is 4.39       19        2
138 4 bucks                 3         1
199 new price 429           13        1
225 price is 9.99           5         1
240 new price is 499        8         2

我已经尝试 data['incorrect_Price'][20:51].str.findall(r"(\d+) dollars") 和 data['incorrect_Price'][20:51].str.findall(r"(\d+) cents") 来查找其中包含 "cents" 和 "dollars" 的行，这样我就可以提取美元和美分的金额，但没有在遍历数据框中的所有行时能够合并它。

  I would like the results to like look this:  

    Incorrect_Price        Desired    Occurences    errors
23  99 cents                .99           732         1
50  3 dollars and 49 cents  3.49          211         1
72  the price is 625        625           128         3
86  new price is 4.39       4.39           19         2
138 4 bucks                 4.00           3          1
199 new price 429           429            13         1
225 price is 9.99           9.99           5          1
240 new price is 499        499            8          2

Answer 1

只要字符串Incorrect_Price保持你在例子中呈现的结构（数字不是用文字表达），这个任务就相对容易解决。

使用正则表达式，您可以使用 similar SO question 中的方法提取数字部分和可选的 "cent"/"cents" 或 "dollar"/"dollars"。两个主要区别是您正在寻找成对的数值和 "cent[s]" 或 "dollar[s]"，并且它们可能不止一次出现。

import re


def extract_number_currency(value):
    prices  = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)

    result = 0.0
    for value, currency in prices:
        partial = float(value)
        if currency == 'cent':
            result += partial / 100
        else:
            result += partial

    return result


print(extract_number_currency('3 dollars and 49 cent'))

3.49

现在，您需要的是将此函数应用于价格为 word 的列中的所有错误值。为简单起见，我在这里将它应用于所有值（但我相信您将能够处理子集）：

data['Desired'] = data['Incorrect_Price'].apply(extract_number_currency)

瞧！

正则表达式的分解'(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?'

有两个捕获 named 组 (?P<name_of_the_capture_group> .... )

第一个捕获组(?P<value>[\d]*[.]?[\d]{1,2})捕获：

[\d] - 位数

[\d]* - 重复 0 次或更多次

[.]? - 后跟可选的 (?) 点

[\d]{1,2} - 后跟一个重复 1 到 2 次的数字

\s* - 表示 0 个或多个空格

现在是第二个捕获组，它更简单：(?P<currency>cent|dollar)

cent|dollar - 它归结为 cent 和 dollar 字符串之间的选择

s? 是 'cent s' 或 'dollar s'

的可选复数形式

从 pandas 数据框中的列中的字符串中提取数字

Extract numbers from strings from a column in pandas dataframe

python

regex

isnull

pandas