自内联 python 代码。 (现在有了 MCVE!)
Self-inlining python code. (Now with MCVE!)
我有一个用 python 编写的程序,其中用户提供命令行参数来说明哪些统计信息、哪些组合应该对某些数据进行处理。
最初我编写的代码会采用 X 组合中的 N 个统计数据并计算结果 - 但是,我发现如果我自己编写代码来执行特定的统计数据组合,它总是会快得多。然后我编写了代码来编写 python 如果我手动完成的话我会写的代码,然后执行它,这非常有效。理想情况下,我想找到一种方法来获得与 python 重写循环时相同的性能,但以某种不需要我的所有函数都是字符串的方式来实现!
以下代码是用于说明问题的最小完整可验证示例。
import time
import argparse
import collections
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter,
description="Demonstration that it is sometimes much faster to use exec() than to not.")
parser.add_argument("--stat", nargs='+', metavar='', action='append',
help='Supply a list of stats to run here. You can use --stat more than once to make multiple groups.')
args = parser.parse_args()
allStats = {}
class stat1:
def __init__(self):
def process(someValue):
return someValue**3
self.calculate = process
allStats['STAT1'] = stat1()
class stat2:
def __init__(self):
def process(someValue):
return someValue*someValue
self.calculate = process
allStats['STAT2'] = stat2()
class stat3:
def __init__(self):
def process(someValue):
return someValue+someValue
self.calculate = process
allStats['STAT3'] = stat3()
allStatsString = {}
allStatsString['STAT1'] = 'STAT1 = someValue**3'
allStatsString['STAT2'] = 'STAT2 = someValue*someValue'
allStatsString['STAT3'] = 'STAT3 = someValue+someValue'
stats_to_run = set() # stats_to_run is a set of the stats the user wants to run, irrespective of grouping.
data = [collections.defaultdict(int) for x in range(0,len(args.stat))] # data is a list of dictionaries. One dictionary for each --stat group.
for group in args.stat:
stats_to_run.update(group)
for stat in group:
if stat not in allStats.keys():
print "I'm sorry Dave, I'm afraid I can't do that."; exit()
loops = 9000000
option = 1
startTime = time.time()
if option == 1:
results = dict.fromkeys(stats_to_run)
for someValue in xrange(0,loops):
for analysis in stats_to_run:
results[analysis] = allStats[analysis].calculate(someValue)
for a, analysis in enumerate(args.stat):
data[a][tuple([ results[stat] for stat in analysis ])] += 1
elif option == 2:
for someValue in xrange(0,loops):
STAT1 = someValue**3
STAT2 = someValue*someValue
STAT3 = someValue+someValue
data[0][(STAT1,STAT2)] += 1 # Store the first result group
data[1][(STAT3,)] += 1 # Store the second result group
else:
execute = 'for someValue in xrange(0,loops):'
for analysis in stats_to_run:
execute += '\n ' + allStatsString[analysis]
for a, analysis in enumerate(args.stat):
if len(analysis) == 1:
execute += '\n data[' + str(a) + '][('+ analysis[0] + ',)] += 1'
else:
execute += '\n data[' + str(a) + '][('+ ','.join(analysis) + ')] += 1'
print execute
exec(execute)
## This bottom bit just adds all these numbers up so we get a single value to compare the different methods with (to make sure they are the same)
total = 0
for group in data:
for stats in group:
total += sum(stats)
print total
print time.time() - startTime
如果使用参数python test.py --stat STAT1 STAT2 --stat STAT3
执行脚本,那么平均:
选项 1 需要 92 秒
选项 2 需要 56 秒
选项3耗时54秒(和上面基本一样,不足为奇)
如果参数变得更复杂,例如“--stat STAT1 --stat STAT2 --stat STAT3 --stat STAT1 STAT2 STAT3”或者循环次数增加,自内联代码之间的差距并且常规 python 代码变得越来越宽:
选项1耗时393s
选项 3 需要 190 秒
通常我的用户会做 50 到 1 亿次循环,可能有 3 组,每组 2 到 5 个统计数据。那里的统计数据本身并不微不足道,但计算时间的差异是小时。
我认为您只是想避免重复计算相同的统计数据。试试这个。请注意,我使用的是 docopt
,因此我使用逗号分隔的列表。您已经以某种方式弄清楚了,但没有告诉我们如何做,所以不用担心 - 这并不重要。 parse_args
中我构建一组统计名称的代码可能是关键。
"""
Usage: calcstats (--analyses <STAT>,...) ... <file> ...
Options:
<file> One or more input filenames
-a,--analyses <STAT> ... One or more stat names to compute
"""
import docopt
import time
_Sequence = 0
_Results = {}
def compute_stat(name):
global _Sequence, _Results
print("Performing analysis: {}".format(name))
time.sleep(1)
_Sequence += 1
_Results[name] = _Sequence
def display_results(groups):
global _Results
groupnum = 1
for grp in groups:
print("*** Group {}:".format(groupnum))
for stat in grp:
print("\t{}: {}".format(stat, _Results[stat]))
print("\n")
def parse_args():
args = docopt.docopt(__doc__)
args['--analyses'] = [stat.split(',') for stat in args['--analyses']]
stat_set = set()
stat_set.update(*args['--analyses'])
args['STATS.unique'] = stat_set
return args
def perform_analyses(stat_set):
for stat in stat_set:
compute_stat(stat)
if __name__ == '__main__':
args = parse_args()
perform_analyses(args['STATS.unique'])
display_results(args['--analyses'])
我有一个用 python 编写的程序,其中用户提供命令行参数来说明哪些统计信息、哪些组合应该对某些数据进行处理。
最初我编写的代码会采用 X 组合中的 N 个统计数据并计算结果 - 但是,我发现如果我自己编写代码来执行特定的统计数据组合,它总是会快得多。然后我编写了代码来编写 python 如果我手动完成的话我会写的代码,然后执行它,这非常有效。理想情况下,我想找到一种方法来获得与 python 重写循环时相同的性能,但以某种不需要我的所有函数都是字符串的方式来实现!
以下代码是用于说明问题的最小完整可验证示例。
import time
import argparse
import collections
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter,
description="Demonstration that it is sometimes much faster to use exec() than to not.")
parser.add_argument("--stat", nargs='+', metavar='', action='append',
help='Supply a list of stats to run here. You can use --stat more than once to make multiple groups.')
args = parser.parse_args()
allStats = {}
class stat1:
def __init__(self):
def process(someValue):
return someValue**3
self.calculate = process
allStats['STAT1'] = stat1()
class stat2:
def __init__(self):
def process(someValue):
return someValue*someValue
self.calculate = process
allStats['STAT2'] = stat2()
class stat3:
def __init__(self):
def process(someValue):
return someValue+someValue
self.calculate = process
allStats['STAT3'] = stat3()
allStatsString = {}
allStatsString['STAT1'] = 'STAT1 = someValue**3'
allStatsString['STAT2'] = 'STAT2 = someValue*someValue'
allStatsString['STAT3'] = 'STAT3 = someValue+someValue'
stats_to_run = set() # stats_to_run is a set of the stats the user wants to run, irrespective of grouping.
data = [collections.defaultdict(int) for x in range(0,len(args.stat))] # data is a list of dictionaries. One dictionary for each --stat group.
for group in args.stat:
stats_to_run.update(group)
for stat in group:
if stat not in allStats.keys():
print "I'm sorry Dave, I'm afraid I can't do that."; exit()
loops = 9000000
option = 1
startTime = time.time()
if option == 1:
results = dict.fromkeys(stats_to_run)
for someValue in xrange(0,loops):
for analysis in stats_to_run:
results[analysis] = allStats[analysis].calculate(someValue)
for a, analysis in enumerate(args.stat):
data[a][tuple([ results[stat] for stat in analysis ])] += 1
elif option == 2:
for someValue in xrange(0,loops):
STAT1 = someValue**3
STAT2 = someValue*someValue
STAT3 = someValue+someValue
data[0][(STAT1,STAT2)] += 1 # Store the first result group
data[1][(STAT3,)] += 1 # Store the second result group
else:
execute = 'for someValue in xrange(0,loops):'
for analysis in stats_to_run:
execute += '\n ' + allStatsString[analysis]
for a, analysis in enumerate(args.stat):
if len(analysis) == 1:
execute += '\n data[' + str(a) + '][('+ analysis[0] + ',)] += 1'
else:
execute += '\n data[' + str(a) + '][('+ ','.join(analysis) + ')] += 1'
print execute
exec(execute)
## This bottom bit just adds all these numbers up so we get a single value to compare the different methods with (to make sure they are the same)
total = 0
for group in data:
for stats in group:
total += sum(stats)
print total
print time.time() - startTime
如果使用参数python test.py --stat STAT1 STAT2 --stat STAT3
执行脚本,那么平均:
选项 1 需要 92 秒
选项 2 需要 56 秒
选项3耗时54秒(和上面基本一样,不足为奇)
如果参数变得更复杂,例如“--stat STAT1 --stat STAT2 --stat STAT3 --stat STAT1 STAT2 STAT3”或者循环次数增加,自内联代码之间的差距并且常规 python 代码变得越来越宽:
选项1耗时393s
选项 3 需要 190 秒
通常我的用户会做 50 到 1 亿次循环,可能有 3 组,每组 2 到 5 个统计数据。那里的统计数据本身并不微不足道,但计算时间的差异是小时。
我认为您只是想避免重复计算相同的统计数据。试试这个。请注意,我使用的是 docopt
,因此我使用逗号分隔的列表。您已经以某种方式弄清楚了,但没有告诉我们如何做,所以不用担心 - 这并不重要。 parse_args
中我构建一组统计名称的代码可能是关键。
"""
Usage: calcstats (--analyses <STAT>,...) ... <file> ...
Options:
<file> One or more input filenames
-a,--analyses <STAT> ... One or more stat names to compute
"""
import docopt
import time
_Sequence = 0
_Results = {}
def compute_stat(name):
global _Sequence, _Results
print("Performing analysis: {}".format(name))
time.sleep(1)
_Sequence += 1
_Results[name] = _Sequence
def display_results(groups):
global _Results
groupnum = 1
for grp in groups:
print("*** Group {}:".format(groupnum))
for stat in grp:
print("\t{}: {}".format(stat, _Results[stat]))
print("\n")
def parse_args():
args = docopt.docopt(__doc__)
args['--analyses'] = [stat.split(',') for stat in args['--analyses']]
stat_set = set()
stat_set.update(*args['--analyses'])
args['STATS.unique'] = stat_set
return args
def perform_analyses(stat_set):
for stat in stat_set:
compute_stat(stat)
if __name__ == '__main__':
args = parse_args()
perform_analyses(args['STATS.unique'])
display_results(args['--analyses'])