慢 python 代码似乎适合 itertools:如何优化?
slow python code seems like a suitable fit for itertools: how to optimize?
我有这个:
entity_key = 'pid'
data = [ { ... }, { ... } ]
entities = list(set([ row[entity_key] for row in data ]))
parsed = []
total_keys = ['a','b','c']
for entity_id in entities:
entity_rows = [ row for row in data if row[entity_key] == entity_id ]
totals = { key: sum(filter(None, [ row.get(key) for row in entity_rows ])) for key in total_keys }
totals[entity_key] = entity_id
parsed.append(totals)
return parsed
在我的场景中,data
大约有 30,000 个项目,它很大。
每个项目都是一个 dict
,每个 dict
包含标识符 pid
,以及 total_keys
中定义的每个项目的数值,例如{ 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 }
如您所见,代码 returns 每个 pid
的唯一行列表,以及在 total_keys
列表中定义的总列。可能有大约 800-1000 个独特的 pid
值,因此 parsed
最终成为大约 800-1000 个项目。
速度很慢。我尝试使用 itertools.groupby
重写它,但它似乎不是最合适的。有没有我缺少的魔法?
你试过了吗Pandas?
如果您将 pid 作为一列,看起来很适合
import pandas as pd
df = pd.DataFrame(your dictionary)
df.groupby(['pid']).sum()
由于您在循环中进行成员测试,因此您获得了 O(n^2)
算法。如果创建索引数据结构,可以大大提高性能。
entity_key = 'pid'
data = [ { ... }, { ... } ]
totals_keys = ['a','b','c']
parsed = []
indexed = {}
for row in data: # construct a map of data rows, indexed by id
entity_id = row[entity_key]
indexed.setdefault(entity_id, []) # start with an empty list
indexed[entity_id].append(row)
for entity_id in entities:
entity_rows = indexed[entity_id] # fast lookup of matching ids
totals = { key: sum(row[key] for row in entity_rows if key in row) for key in totals_keys }
totals[entity_key] = entity_id
parsed.append(totals)
return parsed
使用 pids 作为外键创建一个字典:
entity_key = 'pid'
data = [ { 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 },
{ 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 }]
from collections import defaultdict
totals = ["a", "b", "c"]
dfd = defaultdict(lambda: {"a":0, "b", 0, "c": 0})
for d in data:
for k in d.keys() & totals:
dfd[d["pid"]][k] += d[k]
输出将是所有 pid 的分组和任何 a b 或 c 键值的总和:
defaultdict(<function <lambda> at 0x7f2cf93ed2f0>,
{5011: {'a': 6, 'c': 66, 'b': 40}, 5012: {'a': 6, 'c': 66, 'b': 40}})
对于 python2 你需要使用 uni = d.viewkeys() & totals
如果您的数据实际上是分组的,您可以一次生成一组:
from collections import defaultdict
from itertools import groupby
from operator import itemgetter
def yield_d(data,k, keys):
for k,v in groupby(data, key=itemgetter(k)):
d = defaultdict(lambda: dict.fromkeys(keys, 0))
for dct in v:
for _k in dct.keys() & keys:
d[k][_k] += dct[_k]
yield d
我有这个:
entity_key = 'pid'
data = [ { ... }, { ... } ]
entities = list(set([ row[entity_key] for row in data ]))
parsed = []
total_keys = ['a','b','c']
for entity_id in entities:
entity_rows = [ row for row in data if row[entity_key] == entity_id ]
totals = { key: sum(filter(None, [ row.get(key) for row in entity_rows ])) for key in total_keys }
totals[entity_key] = entity_id
parsed.append(totals)
return parsed
在我的场景中,data
大约有 30,000 个项目,它很大。
每个项目都是一个 dict
,每个 dict
包含标识符 pid
,以及 total_keys
中定义的每个项目的数值,例如{ 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 }
如您所见,代码 returns 每个 pid
的唯一行列表,以及在 total_keys
列表中定义的总列。可能有大约 800-1000 个独特的 pid
值,因此 parsed
最终成为大约 800-1000 个项目。
速度很慢。我尝试使用 itertools.groupby
重写它,但它似乎不是最合适的。有没有我缺少的魔法?
你试过了吗Pandas?
如果您将 pid 作为一列,看起来很适合
import pandas as pd
df = pd.DataFrame(your dictionary)
df.groupby(['pid']).sum()
由于您在循环中进行成员测试,因此您获得了 O(n^2)
算法。如果创建索引数据结构,可以大大提高性能。
entity_key = 'pid'
data = [ { ... }, { ... } ]
totals_keys = ['a','b','c']
parsed = []
indexed = {}
for row in data: # construct a map of data rows, indexed by id
entity_id = row[entity_key]
indexed.setdefault(entity_id, []) # start with an empty list
indexed[entity_id].append(row)
for entity_id in entities:
entity_rows = indexed[entity_id] # fast lookup of matching ids
totals = { key: sum(row[key] for row in entity_rows if key in row) for key in totals_keys }
totals[entity_key] = entity_id
parsed.append(totals)
return parsed
使用 pids 作为外键创建一个字典:
entity_key = 'pid'
data = [ { 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 },
{ 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 }]
from collections import defaultdict
totals = ["a", "b", "c"]
dfd = defaultdict(lambda: {"a":0, "b", 0, "c": 0})
for d in data:
for k in d.keys() & totals:
dfd[d["pid"]][k] += d[k]
输出将是所有 pid 的分组和任何 a b 或 c 键值的总和:
defaultdict(<function <lambda> at 0x7f2cf93ed2f0>,
{5011: {'a': 6, 'c': 66, 'b': 40}, 5012: {'a': 6, 'c': 66, 'b': 40}})
对于 python2 你需要使用 uni = d.viewkeys() & totals
如果您的数据实际上是分组的,您可以一次生成一组:
from collections import defaultdict
from itertools import groupby
from operator import itemgetter
def yield_d(data,k, keys):
for k,v in groupby(data, key=itemgetter(k)):
d = defaultdict(lambda: dict.fromkeys(keys, 0))
for dct in v:
for _k in dct.keys() & keys:
d[k][_k] += dct[_k]
yield d