如何让这个 python string-to-float 函数更有效率?
How to make this python string-to-float function more efficient?
我创建了一个小的 python 脚本,它从一串数字以及一个包含数字和字符串(百万, billion, trillion) 到浮点数列表并打印出来。
假设词组 'million'、'billion' 和 'trillion' 是唯一可以使用的术语,并且它们总是用 space 与数字分隔(如果有数字的话)。
代码如下。有什么办法可以让脚本更简洁高效?
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
if num_phrase=="million":
a[i]=float(a[i].split(" ")[0])*1000000
elif num_phrase=="billion":
a[i]=float(a[i].split(" ")[0])*1000000000
elif num_phrase=="trillion":
a[i]=float(a[i].split(" ")[0])*1000000000000
else:
a[i]=float(a[i].split(" ")[0])
print(list(a))
可以使用字典:
d = {'': 1, 'm': 1e6, 'b': 1e9, 't': 1e12}
a = [float(number) * d[unit[:1]]
for s in a
for number, _, unit in [s.partition(' ')]]
或者用科学记数法代替那些十亿:
a = [float(s.replace(' million', 'e6')
.replace(' billion', 'e9')
.replace(' trillion', 'e12'))
for s in a]
您的列表乘以 1000 的基准结果:
Round 1 Round 2 Round 3
3640 us 3618 us 3555 us original
2747 us 2738 us 2706 us Kelly1
2258 us 2272 us 2214 us Kelly2
3759 us 3841 us 3802 us dim_an
3495 us 3542 us 3562 us motyzk
基准代码(Try it online!):
from timeit import timeit
def baseline(a):
pass
def original(a):
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
if num_phrase=="million":
a[i]=float(a[i].split(" ")[0])*1000000
elif num_phrase=="billion":
a[i]=float(a[i].split(" ")[0])*1000000000
elif num_phrase=="trillion":
a[i]=float(a[i].split(" ")[0])*1000000000000
else:
a[i]=float(a[i].split(" ")[0])
return a
def Kelly1(a):
d = {'': 1, 'm': 1e6, 'b': 1e9, 't': 1e12}
return [float(number) * d[unit[:1]]
for s in a
for number, _, unit in [s.partition(' ')]]
def Kelly2(a):
return [float(s.replace(' million', 'e6')
.replace(' billion', 'e9')
.replace(' trillion', 'e12'))
for s in a]
def dim_an(a):
multipliers = {
"million": 10 ** 6,
"billion": 10 ** 9,
"trillion": 10 ** 12,
}
for i in range(len(a)):
words = a[i].split()
if len(words) == 0 or len(words) > 2:
raise ValueError("Bad string: " + e)
result = float(words[0])
if len(words) == 2:
result *= multipliers[words[1]]
a[i] = result
return a
def motyzk(a):
str_to_num = {
"": 1,
"million": 1000000,
"billion": 1000000000,
"trillion": 1000000000000,
}
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
a[i]=float(a[i].split(" ")[0])*str_to_num[num_phrase]
return a
# config
funcs = original, Kelly1, Kelly2, dim_an, motyzk, baseline
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"] * 1000
number = 100
# correctness
expect = original(a.copy())
for func in funcs:
result = func(a.copy())
print(result == expect, func.__name__)
# speed
tss = [[] for _ in funcs]
for _ in range(3):
print('Round 1 Round 2 Round 3')
for func, ts in zip(funcs, tss):
t = timeit(lambda: func(a.copy()), number=number) / number
ts.append(t)
print(*('%4d us ' % (t * 1e6) for t in ts), func.__name__)
print()
如果我们谈论可读性和性能,我会在这里改变两点:
- 我不会多次调用
str.split
,太贵了。
- 也许我会用 dict 替换多个
if
s 当你有少量分支时不一定会加速,但从我的角度来看它使代码更具可读性并且当你有更多字符串时有帮助。
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
multipliers = {
"million": 10 ** 6,
"billion": 10 ** 9,
"trillion": 10 ** 12,
}
for i in range(len(a)):
words = a[i].split()
if len(words) == 0 or len(words) > 2:
raise ValueError("Bad string: " + e)
result = float(words[0])
if len(words) == 2:
result *= multipliers[words[1]]
a[i] = result
print(a)
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
str_to_num = {
"": 1,
"million": 1000000,
"billion": 1000000000,
"trillion": 1000000000000,
}
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
a[i]=float(a[i].split(" ")[0])*str_to_num[num_phrase]
print(a)
中的代码编写方式与我在生产环境中解决问题的方式很接近,因此它尽量清晰并进行一些额外检查以确保无效字符串引发错误,从而降低性能
如果我们试图尽可能快地解决问题(并且我们不太关心验证输入),我们可以使用以下方法来最大限度地减少字符串修改。
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
def convert(input):
if " million" in input:
return float(input[:-8]) * 1000000.0
elif " billion" in input:
return float(input[:-8]) * 1000000000.0
elif " trillion" in input:
return float(input[:-9]) * 1000000000000.0
else:
return float(input)
print([convert(e) for e in a])
基准测试结果(感谢 Kelly Bundy):
Round 1 Round 2 Round 3
4224 us 4164 us 4170 us original
3129 us 3121 us 3180 us Kelly1
3043 us 3176 us 3100 us Kelly2
4381 us 4425 us 4345 us dim_an
4053 us 4089 us 4119 us motyzk
2160 us 2187 us 2169 us dim_an2
12 us 12 us 12 us baseline
我创建了一个小的 python 脚本,它从一串数字以及一个包含数字和字符串(百万, billion, trillion) 到浮点数列表并打印出来。
假设词组 'million'、'billion' 和 'trillion' 是唯一可以使用的术语,并且它们总是用 space 与数字分隔(如果有数字的话)。
代码如下。有什么办法可以让脚本更简洁高效?
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
if num_phrase=="million":
a[i]=float(a[i].split(" ")[0])*1000000
elif num_phrase=="billion":
a[i]=float(a[i].split(" ")[0])*1000000000
elif num_phrase=="trillion":
a[i]=float(a[i].split(" ")[0])*1000000000000
else:
a[i]=float(a[i].split(" ")[0])
print(list(a))
可以使用字典:
d = {'': 1, 'm': 1e6, 'b': 1e9, 't': 1e12}
a = [float(number) * d[unit[:1]]
for s in a
for number, _, unit in [s.partition(' ')]]
或者用科学记数法代替那些十亿:
a = [float(s.replace(' million', 'e6')
.replace(' billion', 'e9')
.replace(' trillion', 'e12'))
for s in a]
您的列表乘以 1000 的基准结果:
Round 1 Round 2 Round 3
3640 us 3618 us 3555 us original
2747 us 2738 us 2706 us Kelly1
2258 us 2272 us 2214 us Kelly2
3759 us 3841 us 3802 us dim_an
3495 us 3542 us 3562 us motyzk
基准代码(Try it online!):
from timeit import timeit
def baseline(a):
pass
def original(a):
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
if num_phrase=="million":
a[i]=float(a[i].split(" ")[0])*1000000
elif num_phrase=="billion":
a[i]=float(a[i].split(" ")[0])*1000000000
elif num_phrase=="trillion":
a[i]=float(a[i].split(" ")[0])*1000000000000
else:
a[i]=float(a[i].split(" ")[0])
return a
def Kelly1(a):
d = {'': 1, 'm': 1e6, 'b': 1e9, 't': 1e12}
return [float(number) * d[unit[:1]]
for s in a
for number, _, unit in [s.partition(' ')]]
def Kelly2(a):
return [float(s.replace(' million', 'e6')
.replace(' billion', 'e9')
.replace(' trillion', 'e12'))
for s in a]
def dim_an(a):
multipliers = {
"million": 10 ** 6,
"billion": 10 ** 9,
"trillion": 10 ** 12,
}
for i in range(len(a)):
words = a[i].split()
if len(words) == 0 or len(words) > 2:
raise ValueError("Bad string: " + e)
result = float(words[0])
if len(words) == 2:
result *= multipliers[words[1]]
a[i] = result
return a
def motyzk(a):
str_to_num = {
"": 1,
"million": 1000000,
"billion": 1000000000,
"trillion": 1000000000000,
}
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
a[i]=float(a[i].split(" ")[0])*str_to_num[num_phrase]
return a
# config
funcs = original, Kelly1, Kelly2, dim_an, motyzk, baseline
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"] * 1000
number = 100
# correctness
expect = original(a.copy())
for func in funcs:
result = func(a.copy())
print(result == expect, func.__name__)
# speed
tss = [[] for _ in funcs]
for _ in range(3):
print('Round 1 Round 2 Round 3')
for func, ts in zip(funcs, tss):
t = timeit(lambda: func(a.copy()), number=number) / number
ts.append(t)
print(*('%4d us ' % (t * 1e6) for t in ts), func.__name__)
print()
如果我们谈论可读性和性能,我会在这里改变两点:
- 我不会多次调用
str.split
,太贵了。 - 也许我会用 dict 替换多个
if
s 当你有少量分支时不一定会加速,但从我的角度来看它使代码更具可读性并且当你有更多字符串时有帮助。
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
multipliers = {
"million": 10 ** 6,
"billion": 10 ** 9,
"trillion": 10 ** 12,
}
for i in range(len(a)):
words = a[i].split()
if len(words) == 0 or len(words) > 2:
raise ValueError("Bad string: " + e)
result = float(words[0])
if len(words) == 2:
result *= multipliers[words[1]]
a[i] = result
print(a)
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
str_to_num = {
"": 1,
"million": 1000000,
"billion": 1000000000,
"trillion": 1000000000000,
}
for i in range(len(a)):
num_phrase=''
if ' ' in a[i]:
num_phrase=a[i].split(" ")[1]
a[i]=float(a[i].split(" ")[0])*str_to_num[num_phrase]
print(a)
如果我们试图尽可能快地解决问题(并且我们不太关心验证输入),我们可以使用以下方法来最大限度地减少字符串修改。
a = ["10", "1000" , "1.684 million", "356852", "2.5 billion", "3 trillion"]
def convert(input):
if " million" in input:
return float(input[:-8]) * 1000000.0
elif " billion" in input:
return float(input[:-8]) * 1000000000.0
elif " trillion" in input:
return float(input[:-9]) * 1000000000000.0
else:
return float(input)
print([convert(e) for e in a])
基准测试结果(感谢 Kelly Bundy):
Round 1 Round 2 Round 3
4224 us 4164 us 4170 us original
3129 us 3121 us 3180 us Kelly1
3043 us 3176 us 3100 us Kelly2
4381 us 4425 us 4345 us dim_an
4053 us 4089 us 4119 us motyzk
2160 us 2187 us 2169 us dim_an2
12 us 12 us 12 us baseline