将列表字典转换为键值对列表的有效方法
Efficient way to convert dictionary of list to pair list of key and value
我有如下列表的字典(它可以超过 1M 个元素,还假设字典是按键排序的)
import scipy.sparse as sp
d = {0: [0,1], 1: [1,2,3],
2: [3,4,5], 3: [4,5,6],
4: [5,6,7], 5: [7],
6: [7,8,9]}
我想知道将其转换为行和列索引列表的最有效方法(大型字典的最快方法)是什么:
r_index = [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6]
c_index = [0, 1, 1, 2, 3, 3, 4, 5, 4, 5, 6, 5, 6, 7, 7, 7, 8, 9]
以下是我目前的一些解决方案:
使用迭代
row_ind = [k for k, v in d.iteritems() for _ in range(len(v))] # or d.items() in Python 3
col_ind = [i for ids in d.values() for i in ids]
使用 pandas 库
import pandas as pd
df = pd.DataFrame.from_dict(d, orient='index')
df = df.stack().reset_index()
row_ind = list(df['level_0'])
col_ind = list(df[0])
使用 itertools
import itertools
indices = [(x,y) for x, y in itertools.chain.from_iterable([itertools.product((k,), v) for k, v in d.items()])]
indices = np.array(indices)
row_ind = indices[:, 0]
col_ind = indices[:, 1]
如果我的字典中有很多元素,我不确定哪种方法是处理这个问题的最快方法。谢谢!
您可以更改基准的输入大小:
import time
l = xrange(10000)
x = dict([(k, list(l)) for k in xrange(1000)])
def f(d):
row_ind = [k for k, v in d.iteritems() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
def ff(d):
import pandas as pd
df = pd.DataFrame.from_dict(d, orient='index')
df = df.stack().reset_index()
row_ind = list(df['level_0'])
col_ind = list(df[0])
def fff(d):
import itertools
import numpy as np
indices = [(x, y) for x, y in itertools.chain.from_iterable(
[itertools.product((k,), v) for k, v in d.items()])]
indices = np.array(indices)
row_ind = indices[:, 0]
col_ind = indices[:, 1]
alternatives = [f, ff, fff]
for func in alternatives:
begin = time.time()
func(x)
print time.time() - begin
输出:
0.977538108826
5.26920008659
6.98472499847
以目前的样本量,第一种方法似乎更好。但是,如果您有更多时间 select 样本量并等待执行完成,则可能会出现不同的结果。可能最好使用库。
python 中优化的第一条经验法则是,确保将最内层的循环外包给某些库函数。这仅适用于 cpython - pypy 是一个完全不同的故事。
在您的情况下,使用 extend 可以显着加快速度。
import time
l = range(10000)
x = dict([(k, list(l)) for k in range(1000)])
def org(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
def ext(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = []
for ids in d.values():
col_ind.extend(ids)
def ext_both(d):
row_ind = []
for k, v in d.items():
row_ind.extend([k] * len(v))
col_ind = []
for ids in d.values():
col_ind.extend(ids)
functions = [org, ext, ext_both]
for func in functions:
begin = time.time()
func(x)
elapsed = time.time() - begin
print(func.__name__ + ": " + str(elapsed))
使用python2时的输出:
org: 0.512559890747
ext: 0.340406894684
ext_both: 0.149670124054
有一个函数叫做 装饰器。装饰器总是在 def 或 class 函数之上。在您的代码中使用 import timer @timer.Timer()
或类似的东西。您可以 Google 更多。或者转到此 link:https://wiki.python.org/moin/PythonDecorators
我有如下列表的字典(它可以超过 1M 个元素,还假设字典是按键排序的)
import scipy.sparse as sp
d = {0: [0,1], 1: [1,2,3],
2: [3,4,5], 3: [4,5,6],
4: [5,6,7], 5: [7],
6: [7,8,9]}
我想知道将其转换为行和列索引列表的最有效方法(大型字典的最快方法)是什么:
r_index = [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6]
c_index = [0, 1, 1, 2, 3, 3, 4, 5, 4, 5, 6, 5, 6, 7, 7, 7, 8, 9]
以下是我目前的一些解决方案:
使用迭代
row_ind = [k for k, v in d.iteritems() for _ in range(len(v))] # or d.items() in Python 3 col_ind = [i for ids in d.values() for i in ids]
使用 pandas 库
import pandas as pd df = pd.DataFrame.from_dict(d, orient='index') df = df.stack().reset_index() row_ind = list(df['level_0']) col_ind = list(df[0])
使用 itertools
import itertools indices = [(x,y) for x, y in itertools.chain.from_iterable([itertools.product((k,), v) for k, v in d.items()])] indices = np.array(indices) row_ind = indices[:, 0] col_ind = indices[:, 1]
如果我的字典中有很多元素,我不确定哪种方法是处理这个问题的最快方法。谢谢!
您可以更改基准的输入大小:
import time
l = xrange(10000)
x = dict([(k, list(l)) for k in xrange(1000)])
def f(d):
row_ind = [k for k, v in d.iteritems() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
def ff(d):
import pandas as pd
df = pd.DataFrame.from_dict(d, orient='index')
df = df.stack().reset_index()
row_ind = list(df['level_0'])
col_ind = list(df[0])
def fff(d):
import itertools
import numpy as np
indices = [(x, y) for x, y in itertools.chain.from_iterable(
[itertools.product((k,), v) for k, v in d.items()])]
indices = np.array(indices)
row_ind = indices[:, 0]
col_ind = indices[:, 1]
alternatives = [f, ff, fff]
for func in alternatives:
begin = time.time()
func(x)
print time.time() - begin
输出:
0.977538108826
5.26920008659
6.98472499847
以目前的样本量,第一种方法似乎更好。但是,如果您有更多时间 select 样本量并等待执行完成,则可能会出现不同的结果。可能最好使用库。
python 中优化的第一条经验法则是,确保将最内层的循环外包给某些库函数。这仅适用于 cpython - pypy 是一个完全不同的故事。 在您的情况下,使用 extend 可以显着加快速度。
import time
l = range(10000)
x = dict([(k, list(l)) for k in range(1000)])
def org(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
def ext(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = []
for ids in d.values():
col_ind.extend(ids)
def ext_both(d):
row_ind = []
for k, v in d.items():
row_ind.extend([k] * len(v))
col_ind = []
for ids in d.values():
col_ind.extend(ids)
functions = [org, ext, ext_both]
for func in functions:
begin = time.time()
func(x)
elapsed = time.time() - begin
print(func.__name__ + ": " + str(elapsed))
使用python2时的输出:
org: 0.512559890747
ext: 0.340406894684
ext_both: 0.149670124054
有一个函数叫做 装饰器。装饰器总是在 def 或 class 函数之上。在您的代码中使用 import timer @timer.Timer()
或类似的东西。您可以 Google 更多。或者转到此 link:https://wiki.python.org/moin/PythonDecorators