Pandas:将相同日期不同时间的行合并为同一日期的一行(合并同一身份的不同时间的部分数据)

Pandas: Combine rows having same date different time into a single row of the same date(consolidate partial data of different time for same identity)

我有一个示例数据框,如下所示。

import pandas as pd
import numpy as np

NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
    'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01 
    00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
    'Name':['xx',NaN,NaN,'yy',NaN,NaN],
    'Height':[174,NaN,NaN,160,NaN,NaN],
    'Weight':[74,NaN,NaN,58,NaN,NaN],
    'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
    'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}

 df1 = pd.DataFrame(data)
 df1 

我想将同一日期的数据合并到一行中。 'Date' 列采用时间戳格式。 最终输出应如下图所示。

非常感谢任何帮助。谢谢

如果您的数据作为样本正确排序,您可以按如下方式合并数据:

>>> df1.groupby(['ID', pd.Grouper(key='Date', freq='D')]) \
       .sum().reset_index()

  ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174cm   74kg    Male  Hiking,Sports
1  B 2021-09-01   yy  160cm   58kg  Female               
2  B 2021-09-02                                   Singing

新解决方案

旧的解决方案基于问题的初始版本,其中空字符串而不是 NaN 值用于未定义的值,并且所有列都是字符串类型。对于未定义值使用 NaN 的更新问题(甚至当还更新为具有数字和字符串类型的不同列数据类型时),解决方案可以简化如下:

您可以使用 .groupby() + GroupBy.last()ID 和日期(不带时间)分组,然后将 NaN 和非 NaN 元素与 latest(假设列Date按时间顺序排列)非NaN值的ID,如下:

# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])

# Sort `df1` with ['ID', 'Date'] order if not already in this order
#df1 = df1.sort_values(['ID', 'Date'])

df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
             .last()
             .reset_index()
         ).replace([None], [np.nan])

结果:

print(df_out)


   ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174.0   74.0    Male  Hiking,Sports
1  B 2021-09-01   yy  160.0   58.0  Female            NaN
2  B 2021-09-02  NaN    NaN    NaN     NaN        Singing

旧解决方案

可以使用.groupby() + .agg()ID和日期分组,然后聚合NaN和非NaN元素,如下:

# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])

df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
             .agg(lambda x: ''.join(x.dropna().astype(str)))
             .reset_index()
         ).replace('', np.nan)

结果:

print(df_out)


   ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174.0   74.0    Male  Hiking,Sports
1  B 2021-09-01   yy  160.0   58.0  Female            NaN
2  B 2021-09-02  NaN    NaN    NaN     NaN        Singing

由于您的原始问题的所有列都是字符串类型,因此上述代码可以很好地给出所有列的结果作为字符串类型。但是,您编辑的问题包含数字和字符串类型的数据。为了保留原来的数据类型,我们可以修改代码如下:

# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])

df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
             .agg(lambda x: np.nan if len(w:=x.dropna().reset_index(drop=True)) == 0 else w)
             .reset_index()
         )

结果:

print(df_out)


   ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174.0   74.0    Male  Hiking,Sports
1  B 2021-09-01   yy  160.0   58.0  Female            NaN
2  B 2021-09-02  NaN    NaN    NaN     NaN        Singing


print(df_out.dtypes)

ID                   object
Date         datetime64[ns]
Name                 object
Height              float64            <==== retained as numeric dtype
Weight              float64            <==== retained as numeric dtype
Gender               object
Interests            object
dtype: object

首先转换为日期时间和地板:

In [3]: df["Date"] = pd.to_datetime(df["Date"]).dt.floor('D')

In [4]: df
Out[4]:
  ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174cm   74kg
1  A 2021-09-20                       Male
2  A 2021-09-20                             Hiking,Sports
3  B 2021-09-01   yy  160cm   58kg
4  B 2021-09-01                     Female
5  B 2021-09-02                                   Singing

现在使用 groupbysum:

In [5]: df.groupby(["ID", "Date"]).sum().reset_index()
Out[5]:
  ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174cm   74kg    Male  Hiking,Sports
1  B 2021-09-01   yy  160cm   58kg  Female
2  B 2021-09-02                                   Singing