将两个列表列表合并为一个

JOINning two List-of-Lists to one

摆在我面前的问题是 JOIN,SQL-就像两个数组,“键”由两列 YEAR 和 MONTH 组成。这两个数组代表收入(每年和每个月)以及同样的支出。我想加入它们,使用键生成另一个包含四列的数组:YEAR、MONTH、INCOME、EXPENSE。

我的两个数组是:

income = [["2019","Jan.", 2000],
          ["2019","Feb.", 1500],
          [ ---- , ---  , --- ],
          ["2019","Dec.", 1200],
          ["2020","Jan.", 1400],
          [ ---- , ---  , --- ],
          ["2020","Dec.", 1300]]

Expenses = [["2019","Jan.", 1800],
            ["2019","Feb.", 1400],
            [ ---- , ---  , --- ],
            ["2019","Dec.", 1100],
            ["2020","Jan.", 1300],
            [ ---- , ---  , --- ],
            ["2020","Dec.", 1200]]

而期望的结果是:

Joined =   [["2019","Jan.", 2000, 1800],
            ["2019","Feb.", 1500, 1400],
            [ ---- , ---  , ---   ----],
            ["2019","Dec.", 1200, 1100],
            ["2020","Jan.", 1400, 1300],
            [ ---- , ---  , ---   ----],
            ["2020","Dec.", 1300, 1200]]

我该怎么办?列表理解?循环?什么是 pythonic 方式?

只需使用 Pandas 将您的列表(incomeExpenses)转换为 Dataframes,合并它们(在本例中它基本上是 Year 和 Month 的内部连接),然后将您获得的 Dataframe 转换为列表列表。

df1 = pd.DataFrame(income, columns=["Year", "Month", "X"])
df2 = pd.DataFrame(Expenses, columns=["Year", "Month", "Y"])
joined = df1.merge(df2, on=["Year", "Month"]).values.tolist()

输出:

[['2019', 'Jan.', 2000, 1800], ['2019', 'Feb.', 1500, 1400], ['2019', 'Dec.', 1200, 1100], ['2020', 'Jan.', 1400, 1300], ['2020', 'Dec.', 1300, 1200]]

PS:如果您想知道为什么它们不在输出中,我从两个列表中删除了所有 [ ---- , --- , --- ]

没有pandas:

import operator
import itertools

def join(*lists, exclude_positions=()):
    """Join list rows
    
    Example:
        >>> list_a = [["2019","Jan.", 2000],
                      ["2019","Feb.", 1500]]
        >>> list_b = [["2019","Jan.", 1800],
                      ["2019","Feb.", 1400]]
        >>> join(list_a, list_b, exclude_positions=(0,1))
        [["2019","Jan.", 2000, 1800],
        ["2019","Feb.", 1500, 1400]]

    Args:
        *lists: lists to join
        exclude_positions: positions to exclude from merging. The equivalent
        positions from the first list will be used.
    """
    lists_length = len(lists[0])

    if lists_length == 0:
        return []

    if not all(len(l) == lists_length for l in lists):
        raise ValueError("Lists must have the same length")

    iterators = []
    iterators.append(iter(lists[0]))
    
    for l in lists[1:]:
        columns_taken = [i for i in range(len(l[0])) if i not in exclude_positions]
        if len(columns_taken) == 0:
            continue
        iterator = map(operator.itemgetter(*columns_taken), l)
        if len(columns_taken) == 1:
            iterator = ((i,) for i in iterator)
    
        iterators.append(iterator)


    return [list(itertools.chain.from_iterable(row))  for row in zip(*iterators)]

对于已检查的 4、120 和 1200 行,我的解决方案比 pandas 解决方案快 100 到 500 倍:

py -m timeit -s "import temp2" "temp2.join(temp2.income, temp2.Expenses, exclude_positions=(0,1))"
10000 loops, best of 5: 24.2 usec per loop

py -m timeit -s "import pandas as pd; import temp2" "pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X']).merge(pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y']), on=['Year', 'Month']).values.tolist()"
50 loops, best of 5: 8.42 msec per loop

这是因为我使用的是高效的 C 级迭代器和函数,不创建中间数据类型,并且不是按键匹配,而是根据您的评论按行号匹配。我也不需要标准库之外的任何模块。

使用 pandas 的更好解决方案是在不查找的情况下正常合并行:

a = pd.DataFrame(temp2.income, columns=['Year', 'Month', 'X'])
b = pd.DataFrame(temp2.Expenses, columns=['Year', 'Month', 'Y'])
result = pd.DataFrame({'Year': a['Year'], 'Month': a['Month'], 'X': a['X'], 'Y': b['Y']}).values.tolist()

我的解决方案仍然比那个快 3 倍,但是 pandas 简短明了。