Pandas Dataframe - 如何计算第一行和最后一行的差异并在重复出现的组中求和?
Pandas Dataframe - How to calculate the difference by first row and last row and sum it in reoccuring group?
我有一系列的数据处理如下:
- 我有两个包含我需要的数据的列表。
- 我将列表附加到新列表中。 [表列表]
- 将列表转换为数据框并将其导出到 csv 文件。 [tableDf]
这里是 tableList 的简化内容:
Category CategoryName time(s) Power Vapor
1 A 1625448301 593233.36 3353.92
1 A 1625449552 595156.24 3286.8
1 A 1625450802 593833.36 3855.42
2 B 1625452051 595233.37 3353.95
2 B 1625453301 593535.86 3252.92
2 B 1625454552 593473.36 3364.15
3 C 1625455802 593754.32 3233.92
3 C 1625457052 593153.46 3563.52
3 C 1625458301 593854.56 3334.94
4 D 1625459552 593345.75 3353.36
4 D 1625460802 592313.24 3674.95
4 D 1625460802 592313.24 3673.35
1 A 1625463301 597313.23 3658.46
1 A 1625464552 595913.68 3789.45
....
里面的数据是按类别划分的,类别的出现并不总是相似的。
注:时间列数据为unix格式的datetime
这是我想要实现的计划结果:
Category CategoryName TotalTime(s) Power Vapor
1 A (Total time diff 1) (Power SUM 2) (Vapor SUM 1)
2 B (Total time diff 2) (Power SUM 2) (Vapor SUM 2)
3 C (Total time diff 3) (Power SUM 2) (Vapor SUM 3)
4 D (Total time diff 4) (Power SUM 2) (Vapor SUM 4)
数据按类别分组,而通过在分组类别中使用 SUM 函数可以简单地实现 Power 和 Vapor 的总和。
我一直在计算总时间。
比如类别1第一次出现,最后一次和第一次的差值是2501(1625450802 - 1625448301)。
在类别 1 的下一次出现中,最后一个和第一个之间的差异是 2600。所有差异值组合起来创建 总时间差异 1
我试过使用 pd.diff()
以及来自另一个问题的
tableDf['TotalTime(s)'] = tableDf.groupby('Category')['time(s)'].transform(lambda x: x.iat[-1] - x.iat[0])
但是所有这些方法只计算类别 1 的最后一行和第一行。这导致总时间错误。
任何解决方案或建议来计算每个出现类别的最后一行和第一行之间的差异?
只是为了提供一个基于 convtools 的替代选项:
from convtools import conversion as c
from convtools.contrib.tables import Table
# this is an ad hoc converter function; consider generating it once and reusing
# further
converter = (
c.chunk_by(c.item("Category"))
.aggregate(
{
"Category": c.ReduceFuncs.First(c.this).item("Category"),
"CategoryName": c.ReduceFuncs.First(c.this).item("CategoryName"),
"TotalTime(s)": (
c.ReduceFuncs.Last(c.this).item("time(s)")
- c.ReduceFuncs.First(c.this).item("time(s)")
),
"Power": c.ReduceFuncs.Sum(c.item("Power")),
"Vapor": c.ReduceFuncs.Sum(c.item("Vapor")),
}
)
.gen_converter()
)
column_types = {
"time(s)": int,
"Power": float,
"Vapor": float,
}
# this is iterable, so can be consumed only once
prepared_rows_iter = (
Table.from_csv("tmp4.csv", header=True)
# casting column types
.update(
**{
column_name: c.col(column_name).as_type(column_type)
for column_name, column_type in column_types.items()
}
).into_iter_rows(dict)
)
# if list of dicts is needed
result = list(converter(prepared_rows_iter))
assert result == [
{ "Category": "1", "CategoryName": "A", "TotalTime(s)": 2501, "Power": 1782222.96, "Vapor": 10496.14, },
{ "Category": "2", "CategoryName": "B", "TotalTime(s)": 2501, "Power": 1782242.5899999999, "Vapor": 9971.02, },
{ "Category": "3", "CategoryName": "C", "TotalTime(s)": 2499, "Power": 1780762.3399999999, "Vapor": 10132.380000000001, },
{ "Category": "4", "CategoryName": "D", "TotalTime(s)": 1250, "Power": 1777972.23, "Vapor": 10701.66, },
{ "Category": "1", "CategoryName": "A", "TotalTime(s)": 1251, "Power": 1193226.9100000001, "Vapor": 7447.91, },
]
# if csv file is needed
# Table.from_rows(converter(prepared_rows_iter)).into_csv("out.csv")
这是 datar
的解决方案,其中 re-imagines pandas' APIs:
构造数据
>>> from datar.all import f, tribble, group_by, summarise, first, last, sum, relocate
[2022-03-23 10:11:46][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>>
>>> df = tribble(
... f.Category, f.CategoryName, f["time(s)"], f.Power, f.Vapor,
... 1, "A", 1625448301, 593233.36, 353.92,
... 1, "A", 1625449552, 595156.24, 286.8,
... 1, "A", 1625450802, 593833.36, 855.42,
... 2, "B", 1625452051, 595233.37, 353.95,
... 2, "B", 1625453301, 593535.86, 252.92,
... 2, "B", 1625454552, 593473.36, 364.15,
... 3, "C", 1625455802, 593754.32, 233.92,
... 3, "C", 1625457052, 593153.46, 563.52,
... 3, "C", 1625458301, 593854.56, 334.94,
... 4, "D", 1625459552, 593345.75, 353.36,
... 4, "D", 1625460802, 592313.24, 674.95,
... 4, "D", 1625460802, 592313.24, 673.35,
... )
操纵数据
>>> (
... df
... >> group_by(f.Category)
... >> summarise(
... Power=sum(f.Power),
... Vapor=sum(f.Vapor),
... CategoryName=first(f.CategoryName),
... **{
... "TotalTime(s)": last(f["time(s)"]) - first(f["time(s)"]),
... }
... )
... >> relocate(f.CategoryName, f["TotalTime(s)"], _after=f.Category)
... )
Category CategoryName TotalTime(s) Power Vapor
<int64> <object> <int64> <float64> <float64>
0 1 A 2501 1782222.96 1496.14
1 2 B 2501 1782242.59 971.02
2 3 C 2499 1780762.34 1132.38
3 4 D 1250 1777972.23 1701.66
您可以在 pandas
中轻松地做到这一点,您只需要使用一些 groupby 技巧从连续的类别中创建分组,然后应用您的操作:
consec_groupings = (
df['Category'].shift()
.ne(df['Category'])
.groupby(df['Category']).cumsum()
.rename('Consec_Category')
)
intermediate = (
df.groupby(['Category', 'CategoryName', consec_groupings])
.agg({'time(s)': ['first', 'last'], 'Power': 'sum', 'Vapor': 'sum'})
)
intermediate[('time(s)', 'delta')] = (
intermediate[('time(s)', 'last')] - intermediate[('time(s)', 'first')]
)
print(intermediate)
time(s) Power Vapor time(s)
first last sum sum delta
Category CategoryName Consec_Category
1 A 1 1625448301 1625450802 1782222.96 10496.14 2501
2 1625463301 1625464552 1193226.91 7447.91 1251
2 B 1 1625452051 1625454552 1782242.59 9971.02 2501
3 C 1 1625455802 1625458301 1780762.34 10132.38 2499
4 D 1 1625459552 1625460802 1777972.23 10701.66 1250
然后从那个中间产品,你可以很容易地计算出最终输出:
final = (
intermediate[[('time(s)', 'delta'), ('Power', 'sum'), ('Vapor', 'sum')]]
.droplevel(level=1, axis=1)
.groupby(['Category', 'CategoryName']).sum()
)
print(final)
time(s) Power Vapor
Category CategoryName
1 A 3752 2975449.87 17944.05
2 B 2501 1782242.59 9971.02
3 C 2499 1780762.34 10132.38
4 D 1250 1777972.23 10701.66
我有一系列的数据处理如下:
- 我有两个包含我需要的数据的列表。
- 我将列表附加到新列表中。 [表列表]
- 将列表转换为数据框并将其导出到 csv 文件。 [tableDf]
这里是 tableList 的简化内容:
Category CategoryName time(s) Power Vapor
1 A 1625448301 593233.36 3353.92
1 A 1625449552 595156.24 3286.8
1 A 1625450802 593833.36 3855.42
2 B 1625452051 595233.37 3353.95
2 B 1625453301 593535.86 3252.92
2 B 1625454552 593473.36 3364.15
3 C 1625455802 593754.32 3233.92
3 C 1625457052 593153.46 3563.52
3 C 1625458301 593854.56 3334.94
4 D 1625459552 593345.75 3353.36
4 D 1625460802 592313.24 3674.95
4 D 1625460802 592313.24 3673.35
1 A 1625463301 597313.23 3658.46
1 A 1625464552 595913.68 3789.45
....
里面的数据是按类别划分的,类别的出现并不总是相似的。
注:时间列数据为unix格式的datetime
这是我想要实现的计划结果:
Category CategoryName TotalTime(s) Power Vapor
1 A (Total time diff 1) (Power SUM 2) (Vapor SUM 1)
2 B (Total time diff 2) (Power SUM 2) (Vapor SUM 2)
3 C (Total time diff 3) (Power SUM 2) (Vapor SUM 3)
4 D (Total time diff 4) (Power SUM 2) (Vapor SUM 4)
数据按类别分组,而通过在分组类别中使用 SUM 函数可以简单地实现 Power 和 Vapor 的总和。 我一直在计算总时间。
比如类别1第一次出现,最后一次和第一次的差值是2501(1625450802 - 1625448301)。
在类别 1 的下一次出现中,最后一个和第一个之间的差异是 2600。所有差异值组合起来创建 总时间差异 1
我试过使用 pd.diff()
以及来自另一个问题的
tableDf['TotalTime(s)'] = tableDf.groupby('Category')['time(s)'].transform(lambda x: x.iat[-1] - x.iat[0])
但是所有这些方法只计算类别 1 的最后一行和第一行。这导致总时间错误。
任何解决方案或建议来计算每个出现类别的最后一行和第一行之间的差异?
只是为了提供一个基于 convtools 的替代选项:
from convtools import conversion as c
from convtools.contrib.tables import Table
# this is an ad hoc converter function; consider generating it once and reusing
# further
converter = (
c.chunk_by(c.item("Category"))
.aggregate(
{
"Category": c.ReduceFuncs.First(c.this).item("Category"),
"CategoryName": c.ReduceFuncs.First(c.this).item("CategoryName"),
"TotalTime(s)": (
c.ReduceFuncs.Last(c.this).item("time(s)")
- c.ReduceFuncs.First(c.this).item("time(s)")
),
"Power": c.ReduceFuncs.Sum(c.item("Power")),
"Vapor": c.ReduceFuncs.Sum(c.item("Vapor")),
}
)
.gen_converter()
)
column_types = {
"time(s)": int,
"Power": float,
"Vapor": float,
}
# this is iterable, so can be consumed only once
prepared_rows_iter = (
Table.from_csv("tmp4.csv", header=True)
# casting column types
.update(
**{
column_name: c.col(column_name).as_type(column_type)
for column_name, column_type in column_types.items()
}
).into_iter_rows(dict)
)
# if list of dicts is needed
result = list(converter(prepared_rows_iter))
assert result == [
{ "Category": "1", "CategoryName": "A", "TotalTime(s)": 2501, "Power": 1782222.96, "Vapor": 10496.14, },
{ "Category": "2", "CategoryName": "B", "TotalTime(s)": 2501, "Power": 1782242.5899999999, "Vapor": 9971.02, },
{ "Category": "3", "CategoryName": "C", "TotalTime(s)": 2499, "Power": 1780762.3399999999, "Vapor": 10132.380000000001, },
{ "Category": "4", "CategoryName": "D", "TotalTime(s)": 1250, "Power": 1777972.23, "Vapor": 10701.66, },
{ "Category": "1", "CategoryName": "A", "TotalTime(s)": 1251, "Power": 1193226.9100000001, "Vapor": 7447.91, },
]
# if csv file is needed
# Table.from_rows(converter(prepared_rows_iter)).into_csv("out.csv")
这是 datar
的解决方案,其中 re-imagines pandas' APIs:
构造数据
>>> from datar.all import f, tribble, group_by, summarise, first, last, sum, relocate
[2022-03-23 10:11:46][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>>
>>> df = tribble(
... f.Category, f.CategoryName, f["time(s)"], f.Power, f.Vapor,
... 1, "A", 1625448301, 593233.36, 353.92,
... 1, "A", 1625449552, 595156.24, 286.8,
... 1, "A", 1625450802, 593833.36, 855.42,
... 2, "B", 1625452051, 595233.37, 353.95,
... 2, "B", 1625453301, 593535.86, 252.92,
... 2, "B", 1625454552, 593473.36, 364.15,
... 3, "C", 1625455802, 593754.32, 233.92,
... 3, "C", 1625457052, 593153.46, 563.52,
... 3, "C", 1625458301, 593854.56, 334.94,
... 4, "D", 1625459552, 593345.75, 353.36,
... 4, "D", 1625460802, 592313.24, 674.95,
... 4, "D", 1625460802, 592313.24, 673.35,
... )
操纵数据
>>> (
... df
... >> group_by(f.Category)
... >> summarise(
... Power=sum(f.Power),
... Vapor=sum(f.Vapor),
... CategoryName=first(f.CategoryName),
... **{
... "TotalTime(s)": last(f["time(s)"]) - first(f["time(s)"]),
... }
... )
... >> relocate(f.CategoryName, f["TotalTime(s)"], _after=f.Category)
... )
Category CategoryName TotalTime(s) Power Vapor
<int64> <object> <int64> <float64> <float64>
0 1 A 2501 1782222.96 1496.14
1 2 B 2501 1782242.59 971.02
2 3 C 2499 1780762.34 1132.38
3 4 D 1250 1777972.23 1701.66
您可以在 pandas
中轻松地做到这一点,您只需要使用一些 groupby 技巧从连续的类别中创建分组,然后应用您的操作:
consec_groupings = (
df['Category'].shift()
.ne(df['Category'])
.groupby(df['Category']).cumsum()
.rename('Consec_Category')
)
intermediate = (
df.groupby(['Category', 'CategoryName', consec_groupings])
.agg({'time(s)': ['first', 'last'], 'Power': 'sum', 'Vapor': 'sum'})
)
intermediate[('time(s)', 'delta')] = (
intermediate[('time(s)', 'last')] - intermediate[('time(s)', 'first')]
)
print(intermediate)
time(s) Power Vapor time(s)
first last sum sum delta
Category CategoryName Consec_Category
1 A 1 1625448301 1625450802 1782222.96 10496.14 2501
2 1625463301 1625464552 1193226.91 7447.91 1251
2 B 1 1625452051 1625454552 1782242.59 9971.02 2501
3 C 1 1625455802 1625458301 1780762.34 10132.38 2499
4 D 1 1625459552 1625460802 1777972.23 10701.66 1250
然后从那个中间产品,你可以很容易地计算出最终输出:
final = (
intermediate[[('time(s)', 'delta'), ('Power', 'sum'), ('Vapor', 'sum')]]
.droplevel(level=1, axis=1)
.groupby(['Category', 'CategoryName']).sum()
)
print(final)
time(s) Power Vapor
Category CategoryName
1 A 3752 2975449.87 17944.05
2 B 2501 1782242.59 9971.02
3 C 2499 1780762.34 10132.38
4 D 1250 1777972.23 10701.66