有效地将行附加到数据框
Append rows to a dataframe efficiently
我有一个看起来像这样的数据框
import pandas as pd
df = pd.DataFrame({'Timestamp': ['1642847484', '1642847484', '1642847484', '1642847484', '1642847487', '1642847487','1642847487','1642847487','1642847487','1642847487','1642847487','1642847487', '1642847489', '1642847489', '1642847489'],
'value': [11, 10, 14, 20, 3, 2, 9, 48, 5, 20, 12, 20, 56, 12, 8]})
我需要对具有相同时间戳的每组值做一些操作,所以我使用groupBy如下:
df_grouped = df.groupby('Timestamp')
然后遍历每组的行并将结果逐行附加到新数据框中:
df_out = pd.DataFrame(columns=( 'Timestamp', 'value'))
for group_name, df_group in df_grouped:
i = 0
for row_index, row in df_group.iterrows():
row['Timestamp'] = row['Timestamp']* 1000 + i * 30
df_out = df_out.append(row)
i = i+1
print(df_out.tail())
但是我的方法花费了很多时间(700 万行以上),我想知道是否有更有效的方法来做到这一点。谢谢
我觉得itterows
这里没必要,可以用:
def f(x):
x['Timestamp'] = ...
....
return x
df1 = df.groupby('Timestamp').apply(f)
编辑:通过 GroupBy.cumcount
创建计数器 Series
,乘以 Timestamp
:
#if necessary
df['Timestamp'] = df['Timestamp'].astype(np.int64)
df['Timestamp'] = df['Timestamp'] * 1000 + df.groupby('Timestamp').cumcount() * 30
print(df)
Timestamp value
0 1642847484000 11
1 1642847484030 10
2 1642847484060 14
3 1642847484090 20
4 1642847487000 3
5 1642847487030 2
6 1642847487060 9
7 1642847487090 48
8 1642847487120 5
9 1642847487150 20
10 1642847487180 12
11 1642847487210 20
12 1642847489000 56
13 1642847489030 12
14 1642847489060 8
我有一个看起来像这样的数据框
import pandas as pd
df = pd.DataFrame({'Timestamp': ['1642847484', '1642847484', '1642847484', '1642847484', '1642847487', '1642847487','1642847487','1642847487','1642847487','1642847487','1642847487','1642847487', '1642847489', '1642847489', '1642847489'],
'value': [11, 10, 14, 20, 3, 2, 9, 48, 5, 20, 12, 20, 56, 12, 8]})
我需要对具有相同时间戳的每组值做一些操作,所以我使用groupBy如下:
df_grouped = df.groupby('Timestamp')
然后遍历每组的行并将结果逐行附加到新数据框中:
df_out = pd.DataFrame(columns=( 'Timestamp', 'value'))
for group_name, df_group in df_grouped:
i = 0
for row_index, row in df_group.iterrows():
row['Timestamp'] = row['Timestamp']* 1000 + i * 30
df_out = df_out.append(row)
i = i+1
print(df_out.tail())
但是我的方法花费了很多时间(700 万行以上),我想知道是否有更有效的方法来做到这一点。谢谢
我觉得itterows
这里没必要,可以用:
def f(x):
x['Timestamp'] = ...
....
return x
df1 = df.groupby('Timestamp').apply(f)
编辑:通过 GroupBy.cumcount
创建计数器 Series
,乘以 Timestamp
:
#if necessary
df['Timestamp'] = df['Timestamp'].astype(np.int64)
df['Timestamp'] = df['Timestamp'] * 1000 + df.groupby('Timestamp').cumcount() * 30
print(df)
Timestamp value
0 1642847484000 11
1 1642847484030 10
2 1642847484060 14
3 1642847484090 20
4 1642847487000 3
5 1642847487030 2
6 1642847487060 9
7 1642847487090 48
8 1642847487120 5
9 1642847487150 20
10 1642847487180 12
11 1642847487210 20
12 1642847489000 56
13 1642847489030 12
14 1642847489060 8