Pandas:将相同日期不同时间的行合并为同一日期的一行(合并同一身份的不同时间的部分数据)
Pandas: Combine rows having same date different time into a single row of the same date(consolidate partial data of different time for same identity)
我有一个示例数据框,如下所示。
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx',NaN,NaN,'yy',NaN,NaN],
'Height':[174,NaN,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
我想将同一日期的数据合并到一行中。 'Date' 列采用时间戳格式。
最终输出应如下图所示。
非常感谢任何帮助。谢谢
如果您的数据作为样本正确排序,您可以按如下方式合并数据:
>>> df1.groupby(['ID', pd.Grouper(key='Date', freq='D')]) \
.sum().reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg Male Hiking,Sports
1 B 2021-09-01 yy 160cm 58kg Female
2 B 2021-09-02 Singing
新解决方案
旧的解决方案基于问题的初始版本,其中空字符串而不是 NaN
值用于未定义的值,并且所有列都是字符串类型。对于未定义值使用 NaN
的更新问题(甚至当还更新为具有数字和字符串类型的不同列数据类型时),解决方案可以简化如下:
您可以使用 .groupby()
+ GroupBy.last()
按 ID
和日期(不带时间)分组,然后将 NaN
和非 NaN
元素与 latest(假设列Date
按时间顺序排列)非NaN
值的ID
,如下:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
# Sort `df1` with ['ID', 'Date'] order if not already in this order
#df1 = df1.sort_values(['ID', 'Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.last()
.reset_index()
).replace([None], [np.nan])
结果:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
旧解决方案
可以使用.groupby()
+ .agg()
按ID
和日期分组,然后聚合NaN
和非NaN
元素,如下:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
结果:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
由于您的原始问题的所有列都是字符串类型,因此上述代码可以很好地给出所有列的结果作为字符串类型。但是,您编辑的问题包含数字和字符串类型的数据。为了保留原来的数据类型,我们可以修改代码如下:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: np.nan if len(w:=x.dropna().reset_index(drop=True)) == 0 else w)
.reset_index()
)
结果:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
print(df_out.dtypes)
ID object
Date datetime64[ns]
Name object
Height float64 <==== retained as numeric dtype
Weight float64 <==== retained as numeric dtype
Gender object
Interests object
dtype: object
首先转换为日期时间和地板:
In [3]: df["Date"] = pd.to_datetime(df["Date"]).dt.floor('D')
In [4]: df
Out[4]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg
1 A 2021-09-20 Male
2 A 2021-09-20 Hiking,Sports
3 B 2021-09-01 yy 160cm 58kg
4 B 2021-09-01 Female
5 B 2021-09-02 Singing
现在使用 groupby
和 sum
:
In [5]: df.groupby(["ID", "Date"]).sum().reset_index()
Out[5]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg Male Hiking,Sports
1 B 2021-09-01 yy 160cm 58kg Female
2 B 2021-09-02 Singing
我有一个示例数据框,如下所示。
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx',NaN,NaN,'yy',NaN,NaN],
'Height':[174,NaN,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
非常感谢任何帮助。谢谢
如果您的数据作为样本正确排序,您可以按如下方式合并数据:
>>> df1.groupby(['ID', pd.Grouper(key='Date', freq='D')]) \
.sum().reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg Male Hiking,Sports
1 B 2021-09-01 yy 160cm 58kg Female
2 B 2021-09-02 Singing
新解决方案
旧的解决方案基于问题的初始版本,其中空字符串而不是 NaN
值用于未定义的值,并且所有列都是字符串类型。对于未定义值使用 NaN
的更新问题(甚至当还更新为具有数字和字符串类型的不同列数据类型时),解决方案可以简化如下:
您可以使用 .groupby()
+ GroupBy.last()
按 ID
和日期(不带时间)分组,然后将 NaN
和非 NaN
元素与 latest(假设列Date
按时间顺序排列)非NaN
值的ID
,如下:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
# Sort `df1` with ['ID', 'Date'] order if not already in this order
#df1 = df1.sort_values(['ID', 'Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.last()
.reset_index()
).replace([None], [np.nan])
结果:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
旧解决方案
可以使用.groupby()
+ .agg()
按ID
和日期分组,然后聚合NaN
和非NaN
元素,如下:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
结果:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
由于您的原始问题的所有列都是字符串类型,因此上述代码可以很好地给出所有列的结果作为字符串类型。但是,您编辑的问题包含数字和字符串类型的数据。为了保留原来的数据类型,我们可以修改代码如下:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: np.nan if len(w:=x.dropna().reset_index(drop=True)) == 0 else w)
.reset_index()
)
结果:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
print(df_out.dtypes)
ID object
Date datetime64[ns]
Name object
Height float64 <==== retained as numeric dtype
Weight float64 <==== retained as numeric dtype
Gender object
Interests object
dtype: object
首先转换为日期时间和地板:
In [3]: df["Date"] = pd.to_datetime(df["Date"]).dt.floor('D')
In [4]: df
Out[4]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg
1 A 2021-09-20 Male
2 A 2021-09-20 Hiking,Sports
3 B 2021-09-01 yy 160cm 58kg
4 B 2021-09-01 Female
5 B 2021-09-02 Singing
现在使用 groupby
和 sum
:
In [5]: df.groupby(["ID", "Date"]).sum().reset_index()
Out[5]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg Male Hiking,Sports
1 B 2021-09-01 yy 160cm 58kg Female
2 B 2021-09-02 Singing