如何让这个 python string-to-float 函数更有效率?

How to make this python string-to-float function more efficient?

我创建了一个小的 python 脚本,它从一串数字以及一个包含数字和字符串(百万, billion, trillion) 到浮点数列表并打印出来。

假设词组 'million'、'billion' 和 'trillion' 是唯一可以使用的术语,并且它们总是用 space 与数字分隔(如果有数字的话)。

代码如下。有什么办法可以让脚本更简洁高效?


a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]

for i in range(len(a)):

  num_phrase=''

  if ' ' in a[i]:
    num_phrase=a[i].split(" ")[1]

  if num_phrase=="million":
    a[i]=float(a[i].split(" ")[0])*1000000
  elif num_phrase=="billion":
    a[i]=float(a[i].split(" ")[0])*1000000000
  elif num_phrase=="trillion":
    a[i]=float(a[i].split(" ")[0])*1000000000000
  else:
    a[i]=float(a[i].split(" ")[0])


print(list(a))

可以使用字典:

d = {'': 1, 'm': 1e6, 'b': 1e9, 't': 1e12}
a = [float(number) * d[unit[:1]]
     for s in a
     for number, _, unit in [s.partition(' ')]]

或者用科学记数法代替那些十亿:

a = [float(s.replace(' million', 'e6')
            .replace(' billion', 'e9')
            .replace(' trillion', 'e12'))
     for s in a]

您的列表乘以 1000 的基准结果:

Round 1  Round 2  Round 3
3640 us  3618 us  3555 us  original
2747 us  2738 us  2706 us  Kelly1
2258 us  2272 us  2214 us  Kelly2
3759 us  3841 us  3802 us  dim_an
3495 us  3542 us  3562 us  motyzk

基准代码(Try it online!):

from timeit import timeit

def baseline(a):
    pass

def original(a):
 for i in range(len(a)):
  num_phrase=''
  if ' ' in a[i]:
    num_phrase=a[i].split(" ")[1]
  if num_phrase=="million":
    a[i]=float(a[i].split(" ")[0])*1000000
  elif num_phrase=="billion":
    a[i]=float(a[i].split(" ")[0])*1000000000
  elif num_phrase=="trillion":
    a[i]=float(a[i].split(" ")[0])*1000000000000
  else:
    a[i]=float(a[i].split(" ")[0])
 return a

def Kelly1(a):
    d = {'': 1, 'm': 1e6, 'b': 1e9, 't': 1e12}
    return [float(number) * d[unit[:1]]
            for s in a
            for number, _, unit in [s.partition(' ')]]

def Kelly2(a):
    return [float(s.replace(' million', 'e6')
                   .replace(' billion', 'e9')
                   .replace(' trillion', 'e12'))
            for s in a]

def dim_an(a):
 multipliers = {
    "million":  10 ** 6,
    "billion":  10 ** 9,
    "trillion": 10 ** 12,
 }
 for i in range(len(a)):
    words = a[i].split()
    if len(words) == 0 or len(words) > 2:
        raise ValueError("Bad string: " + e)

    result = float(words[0])
    if len(words) == 2:
        result *= multipliers[words[1]]
    a[i] = result
 return a

def motyzk(a):
 str_to_num = {
    "": 1,
    "million": 1000000,
    "billion": 1000000000,
    "trillion": 1000000000000,
 }
 for i in range(len(a)):
  num_phrase=''
  if ' ' in a[i]:
    num_phrase=a[i].split(" ")[1]
  a[i]=float(a[i].split(" ")[0])*str_to_num[num_phrase]
 return a

# config
funcs = original, Kelly1, Kelly2, dim_an, motyzk, baseline
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"] * 1000
number = 100

# correctness
expect = original(a.copy())
for func in funcs:
    result = func(a.copy())
    print(result == expect, func.__name__)

# speed
tss = [[] for _ in funcs]
for _ in range(3):
    print('Round 1  Round 2  Round 3')
    for func, ts in zip(funcs, tss):
        t = timeit(lambda: func(a.copy()), number=number) / number
        ts.append(t)
        print(*('%4d us ' % (t * 1e6) for t in ts), func.__name__)
    print()

如果我们谈论可读性和性能,我会在这里改变两点:

  1. 我不会多次调用 str.split,太贵了。
  2. 也许我会用 dict 替换多个 ifs 当你有少量分支时不一定会加速,但从我的角度来看它使代码更具可读性并且当你有更多字符串时有帮助。
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]

multipliers = {
    "million":  10 ** 6,
    "billion":  10 ** 9,
    "trillion": 10 ** 12,
}

for i in range(len(a)):
    words = a[i].split()
    if len(words) == 0 or len(words) > 2:
        raise ValueError("Bad string: " + e)

    result = float(words[0])
    if len(words) == 2:
        result *= multipliers[words[1]]
    a[i] = result

print(a)
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]

str_to_num = {
    "": 1,
    "million": 1000000,
    "billion": 1000000000,
    "trillion": 1000000000000,
}

for i in range(len(a)):
  
  num_phrase=''

  if ' ' in a[i]:
    num_phrase=a[i].split(" ")[1]

  a[i]=float(a[i].split(" ")[0])*str_to_num[num_phrase]



print(a)

中的代码编写方式与我在生产环境中解决问题的方式很接近,因此它尽量清晰并进行一些额外检查以确保无效字符串引发错误,从而降低性能

如果我们试图尽可能快地解决问题(并且我们不太关心验证输入),我们可以使用以下方法来最大限度地减少字符串修改。

a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
def convert(input):
    if " million" in input:
        return float(input[:-8]) * 1000000.0
    elif " billion" in input:
        return float(input[:-8]) * 1000000000.0
    elif " trillion" in input:
        return float(input[:-9]) * 1000000000000.0
    else:
        return float(input)
print([convert(e) for e in a])

基准测试结果(感谢 Kelly Bundy):

Round 1  Round 2  Round 3
4224 us  4164 us  4170 us  original
3129 us  3121 us  3180 us  Kelly1
3043 us  3176 us  3100 us  Kelly2
4381 us  4425 us  4345 us  dim_an
4053 us  4089 us  4119 us  motyzk
2160 us  2187 us  2169 us  dim_an2
  12 us    12 us    12 us  baseline

Try it