使用插值将行添加到 pandas 数据框
Add rows to pandas dataframe using interpolate
我正在尝试对包含时间序列数据的 pandas DataFrame 进行插值。我有 temp
的每小时数据,我想在半小时点插入 temp
值。这样,我估计每天每个交易时段的 temp
,即。每天 24 小时,即每天 48 个交易时段。
我的 MWE 是
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import pyarrow as pa
import pyarrow.parquet as pq
# my dataset
df = pd.DataFrame()
d1 = '2020-10-21'
d2 = '2020-10-22'
df['date'] = pd.to_datetime([d1]*24+[d2]*24, format='%Y-%m-%d')
df['time'] = pd.date_range(d1, periods=len(df), freq='H').time
df['temp'] = pd.DataFrame((50+20*np.sin(np.linspace(0,0.91*np.pi,len(df))))).values
# combine time and date
df.loc[:,'datetime'] = pd.to_datetime(df.date.astype(str)+' '+df.time.astype(str))
df = df.drop(['date','time'], axis=1)
df = df.set_index('datetime')
# trading period
df['tp'] = pd.DataFrame(df.index.hour.values*2+1).values
# interpolate to find temp and datetime for trading periods 2,4,6,...
for n in df.tp.values:
df.loc[-1,'tp'] = n+1
df = df.sort_values('tp').reset_index(drop=True)
#df = df.interpolate(method='linear')
print(df.head(10))
我正在调整 post 中的答案,但出现错误 TypeError: value should be a 'Timestamp' or 'NaT'. Got 'int' instead.
我怀疑这是由于 df.loc[-1,'tp'] = n+1
行造成的,但不确定如何修复它。
尝试:
df = df.resample('30T').mean().interpolate()
df['tp'] = ((df.index.hour * 60 + df.index.minute) / 30 + 1).astype(int)
尝试 asfreq
然后 interpolate
:
In [36]: df.asfreq('30T').interpolate()
Out[36]:
temp tp
datetime
2020-10-21 00:00:00 50.000000 1.0
2020-10-21 00:30:00 50.607891 2.0
2020-10-21 01:00:00 51.215782 3.0
2020-10-21 01:30:00 51.821424 4.0
2020-10-21 02:00:00 52.427066 5.0
... ... ...
2020-10-22 21:00:00 57.869280 43.0
2020-10-22 21:30:00 57.303145 44.0
2020-10-22 22:00:00 56.737010 45.0
2020-10-22 22:30:00 56.158416 46.0
2020-10-22 23:00:00 55.579822 47.0
[95 rows x 2 columns]
我正在尝试对包含时间序列数据的 pandas DataFrame 进行插值。我有 temp
的每小时数据,我想在半小时点插入 temp
值。这样,我估计每天每个交易时段的 temp
,即。每天 24 小时,即每天 48 个交易时段。
我的 MWE 是
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import pyarrow as pa
import pyarrow.parquet as pq
# my dataset
df = pd.DataFrame()
d1 = '2020-10-21'
d2 = '2020-10-22'
df['date'] = pd.to_datetime([d1]*24+[d2]*24, format='%Y-%m-%d')
df['time'] = pd.date_range(d1, periods=len(df), freq='H').time
df['temp'] = pd.DataFrame((50+20*np.sin(np.linspace(0,0.91*np.pi,len(df))))).values
# combine time and date
df.loc[:,'datetime'] = pd.to_datetime(df.date.astype(str)+' '+df.time.astype(str))
df = df.drop(['date','time'], axis=1)
df = df.set_index('datetime')
# trading period
df['tp'] = pd.DataFrame(df.index.hour.values*2+1).values
# interpolate to find temp and datetime for trading periods 2,4,6,...
for n in df.tp.values:
df.loc[-1,'tp'] = n+1
df = df.sort_values('tp').reset_index(drop=True)
#df = df.interpolate(method='linear')
print(df.head(10))
我正在调整 TypeError: value should be a 'Timestamp' or 'NaT'. Got 'int' instead.
我怀疑这是由于 df.loc[-1,'tp'] = n+1
行造成的,但不确定如何修复它。
尝试:
df = df.resample('30T').mean().interpolate()
df['tp'] = ((df.index.hour * 60 + df.index.minute) / 30 + 1).astype(int)
尝试 asfreq
然后 interpolate
:
In [36]: df.asfreq('30T').interpolate()
Out[36]:
temp tp
datetime
2020-10-21 00:00:00 50.000000 1.0
2020-10-21 00:30:00 50.607891 2.0
2020-10-21 01:00:00 51.215782 3.0
2020-10-21 01:30:00 51.821424 4.0
2020-10-21 02:00:00 52.427066 5.0
... ... ...
2020-10-22 21:00:00 57.869280 43.0
2020-10-22 21:30:00 57.303145 44.0
2020-10-22 22:00:00 56.737010 45.0
2020-10-22 22:30:00 56.158416 46.0
2020-10-22 23:00:00 55.579822 47.0
[95 rows x 2 columns]