有效地将行附加到数据框

Append rows to a dataframe efficiently

我有一个看起来像这样的数据框

import pandas as pd

df = pd.DataFrame({'Timestamp': ['1642847484', '1642847484', '1642847484', '1642847484', '1642847487', '1642847487','1642847487','1642847487','1642847487','1642847487','1642847487','1642847487', '1642847489', '1642847489', '1642847489'],
                   'value': [11, 10, 14, 20, 3, 2, 9, 48, 5, 20, 12, 20, 56, 12, 8]})

我需要对具有相同时间戳的每组值做一些操作,所以我使用groupBy如下:

df_grouped = df.groupby('Timestamp')

然后遍历每组的行并将结果逐行附加到新数据框中:

df_out = pd.DataFrame(columns=( 'Timestamp', 'value'))
for group_name, df_group in df_grouped:
    i = 0
    for row_index, row in df_group.iterrows():
        row['Timestamp'] = row['Timestamp']* 1000  + i * 30
        df_out = df_out.append(row)
        i = i+1
    print(df_out.tail())

但是我的方法花费了很多时间(700 万行以上),我想知道是否有更有效的方法来做到这一点。谢谢

我觉得itterows这里没必要,可以用:

def f(x):

    x['Timestamp'] = ...
    ....
    return x
    
df1 = df.groupby('Timestamp').apply(f)

编辑:通过 GroupBy.cumcount 创建计数器 Series,乘以 Timestamp

#if necessary
df['Timestamp'] = df['Timestamp'].astype(np.int64)

df['Timestamp'] = df['Timestamp'] * 1000 + df.groupby('Timestamp').cumcount() * 30
print(df)
        Timestamp  value
0   1642847484000     11
1   1642847484030     10
2   1642847484060     14
3   1642847484090     20
4   1642847487000      3
5   1642847487030      2
6   1642847487060      9
7   1642847487090     48
8   1642847487120      5
9   1642847487150     20
10  1642847487180     12
11  1642847487210     20
12  1642847489000     56
13  1642847489030     12
14  1642847489060      8