扁平化来自 pandas df 的时间序列数据

Question

我有一个看起来像这样的 df:

我正试图把它变成这样：

以下代码为我提供了一个列表列表，我可以将其转换为 df 并包括预期输出的前 3 列，但不确定如何获取我需要的数字列（注意：我有更多的方法超过 3 个数字列，但将其用作简单说明）。

x=[['ID','Start','End','Number1','Number2','Number3']]
for i in range(len(df)):
    if not(df.iloc[i-1]['DateSpellIndicator']):
        ID= df.iloc[i]['ID']
        start = df.iloc[i]['Date']
    if not(df.iloc[i]['DateSpellIndicator']):
        newrow = [ID, start,df.iloc[i]['Date'],...]
        x.append(newrow)

Answer 1

可以减少一些步骤以获得相同的输出。我使用 cumsum 来获取第一个日期和最后一个日期。使用 list 以您想要的方式获取列。请注意输出的列名与您的示例不同。我假设你可以按照你想要的方式改变它们。

df ['new1'] = ~df['datespell']
df['new2'] = df['new1'].cumsum()-df['new1']
check = df.groupby(['id', 'new2']).agg({'date': {'start': 'first', 'end': 'last'}, 'number': {'cols': lambda x: list(x)}})
check.columns = check.columns.droplevel(0)
check.reset_index(inplace=True)
pd.concat([check,check['cols'].apply(pd.Series)], axis=1).drop(['cols'], axis=1)


id  new2    start   end 0   1   2
0   1   0   2020-01-01  2020-03-01  40.0    50.0    60.0
1   1   1   2020-06-01  2020-06-01  70.0    NaN NaN
2   2   1   2020-07-01  2020-08-01  20.0    30.0    NaN

这是我使用的数据框。

    id  date    number  datespell   new1    new2
0   1   2020-01-01  40  True    False   0
1   1   2020-02-01  50  True    False   0
2   1   2020-03-01  60  False   True    0
3   1   2020-06-01  70  True    False   1
4   2   2020-07-01  20  True    False   1
5   2   2020-08-01  30  False   True    1

Answer 2

这是一种使用 pandas groupby.

的方法

输入数据帧：

    ID  DATE        NUM TORF
0   1   2020-01-01  40  True
1   1   2020-02-01  50  True
2   1   2020-03-01  60  False
3   1   2020-06-01  70  True
4   2   2020-07-01  20  True
5   2   2020-08-01  30  False

输出数据帧：

    END         ID  Number1 Number2 Number3 START
0   2020-08-01  2   20      30.0    NaN     2020-07-01
1   2020-06-01  1   70      NaN     NaN     2020-06-01
2   2020-03-01  1   40      50.0    60.0    2020-01-01

代码：

new_df=pd.DataFrame()
#create groups based on ID
for index, row in df.groupby('ID'):
    #Within each group split at the occurence of False
    dfnew=np.split(row, np.where(row.TORF == False)[0] + 1)
    for sub_df in dfnew:
        #within each subgroup
        if sub_df.empty==False:
            dfmod=pd.DataFrame({'ID':sub_df['ID'].iloc[0],'START':sub_df['DATE'].iloc[0],'END':sub_df['DATE'].iloc[-1]},index=[0])        
            j=0
            for nindex, srow in sub_df.iterrows():
                dfmod['Number{}'.format(j+1)]=srow['NUM']
                j=j+1
            #concatenate the existing and modified dataframes
            new_df=pd.concat([dfmod, new_df], axis=0)
        
new_df.reset_index(drop=True)

扁平化来自 pandas df 的时间序列数据

flattening time series data from pandas df

python

loops

time-series

flatten

pandas