将两个列表列表合并为一个
JOINning two List-of-Lists to one
摆在我面前的问题是 JOIN,SQL-就像两个数组,“键”由两列 YEAR 和 MONTH 组成。这两个数组代表收入(每年和每个月)以及同样的支出。我想加入它们,使用键生成另一个包含四列的数组:YEAR、MONTH、INCOME、EXPENSE。
我的两个数组是:
income = [["2019","Jan.", 2000],
["2019","Feb.", 1500],
[ ---- , --- , --- ],
["2019","Dec.", 1200],
["2020","Jan.", 1400],
[ ---- , --- , --- ],
["2020","Dec.", 1300]]
Expenses = [["2019","Jan.", 1800],
["2019","Feb.", 1400],
[ ---- , --- , --- ],
["2019","Dec.", 1100],
["2020","Jan.", 1300],
[ ---- , --- , --- ],
["2020","Dec.", 1200]]
而期望的结果是:
Joined = [["2019","Jan.", 2000, 1800],
["2019","Feb.", 1500, 1400],
[ ---- , --- , --- ----],
["2019","Dec.", 1200, 1100],
["2020","Jan.", 1400, 1300],
[ ---- , --- , --- ----],
["2020","Dec.", 1300, 1200]]
我该怎么办?列表理解?循环?什么是 pythonic 方式?
只需使用 Pandas 将您的列表(income 和 Expenses)转换为 Dataframes,合并它们(在本例中它基本上是 Year 和 Month 的内部连接),然后将您获得的 Dataframe 转换为列表列表。
df1 = pd.DataFrame(income, columns=["Year", "Month", "X"])
df2 = pd.DataFrame(Expenses, columns=["Year", "Month", "Y"])
joined = df1.merge(df2, on=["Year", "Month"]).values.tolist()
输出:
[['2019', 'Jan.', 2000, 1800], ['2019', 'Feb.', 1500, 1400], ['2019', 'Dec.', 1200, 1100], ['2020', 'Jan.', 1400, 1300], ['2020', 'Dec.', 1300, 1200]]
PS:如果您想知道为什么它们不在输出中,我从两个列表中删除了所有 [ ---- , --- , --- ]
。
没有pandas:
import operator
import itertools
def join(*lists, exclude_positions=()):
"""Join list rows
Example:
>>> list_a = [["2019","Jan.", 2000],
["2019","Feb.", 1500]]
>>> list_b = [["2019","Jan.", 1800],
["2019","Feb.", 1400]]
>>> join(list_a, list_b, exclude_positions=(0,1))
[["2019","Jan.", 2000, 1800],
["2019","Feb.", 1500, 1400]]
Args:
*lists: lists to join
exclude_positions: positions to exclude from merging. The equivalent
positions from the first list will be used.
"""
lists_length = len(lists[0])
if lists_length == 0:
return []
if not all(len(l) == lists_length for l in lists):
raise ValueError("Lists must have the same length")
iterators = []
iterators.append(iter(lists[0]))
for l in lists[1:]:
columns_taken = [i for i in range(len(l[0])) if i not in exclude_positions]
if len(columns_taken) == 0:
continue
iterator = map(operator.itemgetter(*columns_taken), l)
if len(columns_taken) == 1:
iterator = ((i,) for i in iterator)
iterators.append(iterator)
return [list(itertools.chain.from_iterable(row)) for row in zip(*iterators)]
对于已检查的 4、120 和 1200 行,我的解决方案比 pandas 解决方案快 100 到 500 倍:
py -m timeit -s "import temp2" "temp2.join(temp2.income, temp2.Expenses, exclude_positions=(0,1))"
10000 loops, best of 5: 24.2 usec per loop
py -m timeit -s "import pandas as pd; import temp2" "pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X']).merge(pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y']), on=['Year', 'Month']).values.tolist()"
50 loops, best of 5: 8.42 msec per loop
这是因为我使用的是高效的 C 级迭代器和函数,不创建中间数据类型,并且不是按键匹配,而是根据您的评论按行号匹配。我也不需要标准库之外的任何模块。
使用 pandas 的更好解决方案是在不查找的情况下正常合并行:
a = pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X'])
b = pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y'])
result = pd.DataFrame({'Year': a['Year'], 'Month': a['Month'], 'X': a['X'], 'Y': b['Y']}).values.tolist()
我的解决方案仍然比那个快 3 倍,但是 pandas 简短明了。
摆在我面前的问题是 JOIN,SQL-就像两个数组,“键”由两列 YEAR 和 MONTH 组成。这两个数组代表收入(每年和每个月)以及同样的支出。我想加入它们,使用键生成另一个包含四列的数组:YEAR、MONTH、INCOME、EXPENSE。
我的两个数组是:
income = [["2019","Jan.", 2000],
["2019","Feb.", 1500],
[ ---- , --- , --- ],
["2019","Dec.", 1200],
["2020","Jan.", 1400],
[ ---- , --- , --- ],
["2020","Dec.", 1300]]
Expenses = [["2019","Jan.", 1800],
["2019","Feb.", 1400],
[ ---- , --- , --- ],
["2019","Dec.", 1100],
["2020","Jan.", 1300],
[ ---- , --- , --- ],
["2020","Dec.", 1200]]
而期望的结果是:
Joined = [["2019","Jan.", 2000, 1800],
["2019","Feb.", 1500, 1400],
[ ---- , --- , --- ----],
["2019","Dec.", 1200, 1100],
["2020","Jan.", 1400, 1300],
[ ---- , --- , --- ----],
["2020","Dec.", 1300, 1200]]
我该怎么办?列表理解?循环?什么是 pythonic 方式?
只需使用 Pandas 将您的列表(income 和 Expenses)转换为 Dataframes,合并它们(在本例中它基本上是 Year 和 Month 的内部连接),然后将您获得的 Dataframe 转换为列表列表。
df1 = pd.DataFrame(income, columns=["Year", "Month", "X"])
df2 = pd.DataFrame(Expenses, columns=["Year", "Month", "Y"])
joined = df1.merge(df2, on=["Year", "Month"]).values.tolist()
输出:
[['2019', 'Jan.', 2000, 1800], ['2019', 'Feb.', 1500, 1400], ['2019', 'Dec.', 1200, 1100], ['2020', 'Jan.', 1400, 1300], ['2020', 'Dec.', 1300, 1200]]
PS:如果您想知道为什么它们不在输出中,我从两个列表中删除了所有 [ ---- , --- , --- ]
。
没有pandas:
import operator
import itertools
def join(*lists, exclude_positions=()):
"""Join list rows
Example:
>>> list_a = [["2019","Jan.", 2000],
["2019","Feb.", 1500]]
>>> list_b = [["2019","Jan.", 1800],
["2019","Feb.", 1400]]
>>> join(list_a, list_b, exclude_positions=(0,1))
[["2019","Jan.", 2000, 1800],
["2019","Feb.", 1500, 1400]]
Args:
*lists: lists to join
exclude_positions: positions to exclude from merging. The equivalent
positions from the first list will be used.
"""
lists_length = len(lists[0])
if lists_length == 0:
return []
if not all(len(l) == lists_length for l in lists):
raise ValueError("Lists must have the same length")
iterators = []
iterators.append(iter(lists[0]))
for l in lists[1:]:
columns_taken = [i for i in range(len(l[0])) if i not in exclude_positions]
if len(columns_taken) == 0:
continue
iterator = map(operator.itemgetter(*columns_taken), l)
if len(columns_taken) == 1:
iterator = ((i,) for i in iterator)
iterators.append(iterator)
return [list(itertools.chain.from_iterable(row)) for row in zip(*iterators)]
对于已检查的 4、120 和 1200 行,我的解决方案比 pandas 解决方案快 100 到 500 倍:
py -m timeit -s "import temp2" "temp2.join(temp2.income, temp2.Expenses, exclude_positions=(0,1))"
10000 loops, best of 5: 24.2 usec per loop
py -m timeit -s "import pandas as pd; import temp2" "pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X']).merge(pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y']), on=['Year', 'Month']).values.tolist()"
50 loops, best of 5: 8.42 msec per loop
这是因为我使用的是高效的 C 级迭代器和函数,不创建中间数据类型,并且不是按键匹配,而是根据您的评论按行号匹配。我也不需要标准库之外的任何模块。
使用 pandas 的更好解决方案是在不查找的情况下正常合并行:
a = pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X'])
b = pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y'])
result = pd.DataFrame({'Year': a['Year'], 'Month': a['Month'], 'X': a['X'], 'Y': b['Y']}).values.tolist()
我的解决方案仍然比那个快 3 倍,但是 pandas 简短明了。