将两个列表列表合并为一个

Question

摆在我面前的问题是 JOIN，SQL-就像两个数组，“键”由两列 YEAR 和 MONTH 组成。这两个数组代表收入（每年和每个月）以及同样的支出。我想加入它们，使用键生成另一个包含四列的数组：YEAR、MONTH、INCOME、EXPENSE。

我的两个数组是：

income = [["2019","Jan.", 2000],
          ["2019","Feb.", 1500],
          [ ---- , ---  , --- ],
          ["2019","Dec.", 1200],
          ["2020","Jan.", 1400],
          [ ---- , ---  , --- ],
          ["2020","Dec.", 1300]]

Expenses = [["2019","Jan.", 1800],
            ["2019","Feb.", 1400],
            [ ---- , ---  , --- ],
            ["2019","Dec.", 1100],
            ["2020","Jan.", 1300],
            [ ---- , ---  , --- ],
            ["2020","Dec.", 1200]]

而期望的结果是：

Joined =   [["2019","Jan.", 2000, 1800],
            ["2019","Feb.", 1500, 1400],
            [ ---- , ---  , ---   ----],
            ["2019","Dec.", 1200, 1100],
            ["2020","Jan.", 1400, 1300],
            [ ---- , ---  , ---   ----],
            ["2020","Dec.", 1300, 1200]]

我该怎么办？列表理解？循环？什么是 pythonic 方式？

Answer 1

只需使用 Pandas 将您的列表（income 和 Expenses）转换为 Dataframes，合并它们（在本例中它基本上是 Year 和 Month 的内部连接），然后将您获得的 Dataframe 转换为列表列表。

df1 = pd.DataFrame(income, columns=["Year", "Month", "X"])
df2 = pd.DataFrame(Expenses, columns=["Year", "Month", "Y"])
joined = df1.merge(df2, on=["Year", "Month"]).values.tolist()

输出：

[['2019', 'Jan.', 2000, 1800], ['2019', 'Feb.', 1500, 1400], ['2019', 'Dec.', 1200, 1100], ['2020', 'Jan.', 1400, 1300], ['2020', 'Dec.', 1300, 1200]]

PS：如果您想知道为什么它们不在输出中，我从两个列表中删除了所有 [ ---- , --- , --- ]。

Answer 2

没有pandas:

import operator
import itertools

def join(*lists, exclude_positions=()):
    """Join list rows
    
    Example:
        >>> list_a = [["2019","Jan.", 2000],
                      ["2019","Feb.", 1500]]
        >>> list_b = [["2019","Jan.", 1800],
                      ["2019","Feb.", 1400]]
        >>> join(list_a, list_b, exclude_positions=(0,1))
        [["2019","Jan.", 2000, 1800],
        ["2019","Feb.", 1500, 1400]]

    Args:
        *lists: lists to join
        exclude_positions: positions to exclude from merging. The equivalent
        positions from the first list will be used.
    """
    lists_length = len(lists[0])

    if lists_length == 0:
        return []

    if not all(len(l) == lists_length for l in lists):
        raise ValueError("Lists must have the same length")

    iterators = []
    iterators.append(iter(lists[0]))
    
    for l in lists[1:]:
        columns_taken = [i for i in range(len(l[0])) if i not in exclude_positions]
        if len(columns_taken) == 0:
            continue
        iterator = map(operator.itemgetter(*columns_taken), l)
        if len(columns_taken) == 1:
            iterator = ((i,) for i in iterator)
    
        iterators.append(iterator)


    return [list(itertools.chain.from_iterable(row))  for row in zip(*iterators)]

对于已检查的 4、120 和 1200 行，我的解决方案比 pandas 解决方案快 100 到 500 倍：

py -m timeit -s "import temp2" "temp2.join(temp2.income, temp2.Expenses, exclude_positions=(0,1))"
10000 loops, best of 5: 24.2 usec per loop

py -m timeit -s "import pandas as pd; import temp2" "pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X']).merge(pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y']), on=['Year', 'Month']).values.tolist()"
50 loops, best of 5: 8.42 msec per loop

这是因为我使用的是高效的 C 级迭代器和函数，不创建中间数据类型，并且不是按键匹配，而是根据您的评论按行号匹配。我也不需要标准库之外的任何模块。

使用 pandas 的更好解决方案是在不查找的情况下正常合并行：

a = pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X'])
b = pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y'])
result = pd.DataFrame({'Year': a['Year'], 'Month': a['Month'], 'X': a['X'], 'Y': b['Y']}).values.tolist()

我的解决方案仍然比那个快 3 倍，但是 pandas 简短明了。

将两个列表列表合并为一个

JOINning two List-of-Lists to one

python

inner-join

python-3.x