Pandas Dataframe - 如何计算第一行和最后一行的差异并在重复出现的组中求和?

Pandas Dataframe - How to calculate the difference by first row and last row and sum it in reoccuring group?

我有一系列的数据处理如下:

这里是 tableList 的简化内容:

Category      CategoryName     time(s)      Power       Vapor
   1               A          1625448301   593233.36    3353.92
   1               A          1625449552   595156.24    3286.8
   1               A          1625450802   593833.36    3855.42
   2               B          1625452051   595233.37    3353.95
   2               B          1625453301   593535.86    3252.92
   2               B          1625454552   593473.36    3364.15
   3               C          1625455802   593754.32    3233.92
   3               C          1625457052   593153.46    3563.52
   3               C          1625458301   593854.56    3334.94
   4               D          1625459552   593345.75    3353.36
   4               D          1625460802   592313.24    3674.95
   4               D          1625460802   592313.24    3673.35
   1               A          1625463301   597313.23    3658.46
   1               A          1625464552   595913.68    3789.45
   ....

里面的数据是按类别划分的,类别的出现并不总是相似的。
注:时间列数据为unix格式的datetime

这是我想要实现的计划结果:

Category      CategoryName    TotalTime(s)           Power           Vapor
       1           A          (Total time diff 1) (Power SUM 2)    (Vapor SUM 1)
       2           B          (Total time diff 2) (Power SUM 2)    (Vapor SUM 2)
       3           C          (Total time diff 3) (Power SUM 2)    (Vapor SUM 3)
       4           D          (Total time diff 4) (Power SUM 2)    (Vapor SUM 4)

数据按类别分组,而通过在分组类别中使用 SUM 函数可以简单地实现 Power 和 Vapor 的总和。 我一直在计算总时间。

比如类别1第一次出现,最后一次和第一次的差值是2501(1625450802 - 1625448301)。

在类别 1 的下一次出现中,最后一个和第一个之间的差异是 2600。所有差异值组合起来创建 总时间差异 1

我试过使用 pd.diff() 以及来自另一个问题的

tableDf['TotalTime(s)'] = tableDf.groupby('Category')['time(s)'].transform(lambda x: x.iat[-1] - x.iat[0])

但是所有这些方法只计算类别 1 的最后一行和第一行。这导致总时间错误。

任何解决方案或建议来计算每个出现类别的最后一行和第一行之间的差异?

只是为了提供一个基于 convtools 的替代选项:

from convtools import conversion as c
from convtools.contrib.tables import Table


# this is an ad hoc converter function; consider generating it once and reusing
# further
converter = (
    c.chunk_by(c.item("Category"))
    .aggregate(
        {
            "Category": c.ReduceFuncs.First(c.this).item("Category"),
            "CategoryName": c.ReduceFuncs.First(c.this).item("CategoryName"),
            "TotalTime(s)": (
                c.ReduceFuncs.Last(c.this).item("time(s)")
                - c.ReduceFuncs.First(c.this).item("time(s)")
            ),
            "Power": c.ReduceFuncs.Sum(c.item("Power")),
            "Vapor": c.ReduceFuncs.Sum(c.item("Vapor")),
        }
    )
    .gen_converter()
)

column_types = {
    "time(s)": int,
    "Power": float,
    "Vapor": float,
}

# this is iterable, so can be consumed only once
prepared_rows_iter = (
    Table.from_csv("tmp4.csv", header=True)
    # casting column types
    .update(
        **{
            column_name: c.col(column_name).as_type(column_type)
            for column_name, column_type in column_types.items()
        }
    ).into_iter_rows(dict)
)

# if list of dicts is needed
result = list(converter(prepared_rows_iter))
assert result == [
    { "Category": "1", "CategoryName": "A", "TotalTime(s)": 2501, "Power": 1782222.96, "Vapor": 10496.14, },
    { "Category": "2", "CategoryName": "B", "TotalTime(s)": 2501, "Power": 1782242.5899999999, "Vapor": 9971.02, },
    { "Category": "3", "CategoryName": "C", "TotalTime(s)": 2499, "Power": 1780762.3399999999, "Vapor": 10132.380000000001, },
    { "Category": "4", "CategoryName": "D", "TotalTime(s)": 1250, "Power": 1777972.23, "Vapor": 10701.66, },
    { "Category": "1", "CategoryName": "A", "TotalTime(s)": 1251, "Power": 1193226.9100000001, "Vapor": 7447.91, },
]

# if csv file is needed
# Table.from_rows(converter(prepared_rows_iter)).into_csv("out.csv")

这是 datar 的解决方案,其中 re-imagines pandas' APIs:

构造数据
>>> from datar.all import f, tribble, group_by, summarise, first, last, sum, relocate
[2022-03-23 10:11:46][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>> 
>>> df = tribble(
...     f.Category,  f.CategoryName, f["time(s)"], f.Power,   f.Vapor,
...     1,           "A",            1625448301,   593233.36, 353.92,
...     1,           "A",            1625449552,   595156.24, 286.8,
...     1,           "A",            1625450802,   593833.36, 855.42,
...     2,           "B",            1625452051,   595233.37, 353.95,
...     2,           "B",            1625453301,   593535.86, 252.92,
...     2,           "B",            1625454552,   593473.36, 364.15,
...     3,           "C",            1625455802,   593754.32, 233.92,
...     3,           "C",            1625457052,   593153.46, 563.52,
...     3,           "C",            1625458301,   593854.56, 334.94,
...     4,           "D",            1625459552,   593345.75, 353.36,
...     4,           "D",            1625460802,   592313.24, 674.95,
...     4,           "D",            1625460802,   592313.24, 673.35,
... )

操纵数据
>>> (
...     df 
...     >> group_by(f.Category) 
...     >> summarise(
...         Power=sum(f.Power),
...         Vapor=sum(f.Vapor),
...         CategoryName=first(f.CategoryName),
...         **{
...             "TotalTime(s)": last(f["time(s)"]) - first(f["time(s)"]),
...         }
...     ) 
...     >> relocate(f.CategoryName, f["TotalTime(s)"], _after=f.Category)
... )
   Category CategoryName  TotalTime(s)       Power     Vapor
    <int64>     <object>       <int64>   <float64> <float64>
0         1            A          2501  1782222.96   1496.14
1         2            B          2501  1782242.59    971.02
2         3            C          2499  1780762.34   1132.38
3         4            D          1250  1777972.23   1701.66

您可以在 pandas 中轻松地做到这一点,您只需要使用一些 groupby 技巧从连续的类别中创建分组,然后应用您的操作:

consec_groupings = (
    df['Category'].shift()
    .ne(df['Category'])
    .groupby(df['Category']).cumsum()
    .rename('Consec_Category')
)

intermediate = (
    df.groupby(['Category', 'CategoryName', consec_groupings])
    .agg({'time(s)': ['first', 'last'], 'Power': 'sum', 'Vapor': 'sum'})
)

intermediate[('time(s)', 'delta')] = (
    intermediate[('time(s)', 'last')] - intermediate[('time(s)', 'first')]
)

print(intermediate)
                                          time(s)                   Power     Vapor time(s)
                                            first        last         sum       sum   delta
Category CategoryName Consec_Category                                                      
1        A            1                1625448301  1625450802  1782222.96  10496.14    2501
                      2                1625463301  1625464552  1193226.91   7447.91    1251
2        B            1                1625452051  1625454552  1782242.59   9971.02    2501
3        C            1                1625455802  1625458301  1780762.34  10132.38    2499
4        D            1                1625459552  1625460802  1777972.23  10701.66    1250

然后从那个中间产品,你可以很容易地计算出最终输出:

final = (
    intermediate[[('time(s)', 'delta'), ('Power', 'sum'), ('Vapor', 'sum')]]
    .droplevel(level=1, axis=1)
    .groupby(['Category', 'CategoryName']).sum()
)

print(final)
                       time(s)       Power     Vapor
Category CategoryName                               
1        A                3752  2975449.87  17944.05
2        B                2501  1782242.59   9971.02
3        C                2499  1780762.34  10132.38
4        D                1250  1777972.23  10701.66