匹配,然后分组列表元素

matching, then grouping list elements

我已经解析了一个提取相关数据的文本文件。然后,我将变量(dlOrbit2、imageId3、imageStart4、imageEnd4)组合在一起,在列表中创建了一系列 4 个字符串。

combined = str(','.join([dlOrbit2, imageId3, imageStart4, imageEnd4]))
strSplit = combined.split(',')

打印 strSplit

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53']
['46290', '514628', '2016-10-26 13:12:54', '2016-10-26 13:13:13']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

我想在第一列中匹配和分组元素。因此,46284 x 4、46288 x 6、46290 x 2、46291 x 4。在这些组中,我希望元素 2 的最早时间和元素 3 的最晚时间。因此,所需的输出将是:

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57']
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

此列表将始终包含 4 个元素,但分组元素的编号(第一列)将始终发生变化。

我要将这些结果导出到 CSV 文件中。但是,我只需要以上部分的帮助。

使用pandas:

import pandas as pd

dat = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

df = pd.DataFrame(dat).drop_duplicates()
df_times = df.groupby([0]).agg({2:min,3:max}).reset_index()
df_times.merge(df,on=[0,2])[[0,1,2,'3_x']]

输出:

0   46284   514607  2016-10-26 02:43:46 2016-10-26 02:48:39
1   46288   514626  2016-10-26 09:48:26 2016-10-26 09:54:57
2   46290   514628  2016-10-26 13:12:34 2016-10-26 13:13:13
3   46291   514738  2016-10-26 14:56:39 2016-10-26 14:59:06

您可以利用 groupby and tee:

data = [
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
    ['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
]


from itertools import groupby, tee
import pprint

res = []
for k, g in groupby(data, key=lambda x: x[0]):
    it1, it2, it3 = tee(g, 3)
    res.append(next(it1)[:2] + [min(x[2] for x in it2), max(x[3] for x in it3)])

pprint.pprint(res)

输出:

[['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
 ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57'],
 ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13'],
 ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

for k, g in groupby(data, key=lambda x: x[0]) 将根据第一列对连续的行进行分组。它将 return 一个元组,其中第一项是用于分组的键,第二项是对组项的迭代器。

it1, it2, it3 = tee(g, 3) 会将组迭代器拆分为三个迭代器,每个迭代器将 return 完全相同的项目。最后,通过从第一个分组项目中获取前两列并在其他两个迭代器上获取 运行 min & max 来构造结果。

我自己是 Python 的新手,在使用 Big Hammers 之前,我想先看看具有基本 python 功能的示例。

如果它可以在不到十几行代码的情况下在没有模块导入的情况下完成,我希望学习第一个。

也许无法理解使用双索引操作列表的列表?

combined = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'], ['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

combined[0][0]    # double index
Out[28]: '46284'

combined[2][2:]   # slice
Out[29]: ['2016-10-26 02:43:46', '2016-10-26 02:48:39']

max(combined[2][2:])    # duck type order comparison
Out[30]: '2016-10-26 02:48:39'

为什么不定义一个函数来在分组之前在输入列表上使用这些基本 Python 工具?