Pandas DataFrame 通过将观察值划分为比率来进行下采样
Downsampling in Pandas DataFrame by dividing observations into ratios
给定一个具有时间戳 (ts) 的 DataFrame,我想按小时计算这些数据(下采样)。以前由 ts 索引的值现在应该根据一小时剩余的分钟数分成比率。 [注意:在进行重采样时按 NaN 列的比率划分数据]
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
我想根据花费的时间与产出的比率来估算产出。这可以与 how many lines of code have been completed or find how many actual lines per hour?
例如:在“2020-09-10 00:10:00”我们有22。那么在21:01-00:10期间,我们根据
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
结果应该是这样的。
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
Problem with this approach:
在我看来,这种方法存在严重问题。如果您查看行 [1] -> 2020-09-10 07:00:00,我们有 4 个,我们需要将其划分为 3 个小时。考虑到基本持续时间值为 1(基本单位),我们得到
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
Source
duration event c ts
0 4 c event3.5 2020-09-10 07:00:00
1 4 c event3.5 2020-09-10 10:00:00
Expected Output
ts_hourly mins duration
0 2020-09-10 07:00:00 60.0 2
1 2020-09-10 08:00:00 60.0 1
2 2020-09-10 09:00:00 60.0 1
3 2020-09-10 10:00:00 0.0 0
我无法在 pandas
中找到解决方案,因此我创建了一个简单的解决方案 python。
基本上,我在排序后遍历所有值并将两个日期时间即 start_time
和 end_time
发送到一个函数,该函数进行处理。
def get_ratio_per_hour(start_time: list, end_time: list, data_: int):
# get total hours between the start and end, use this for looping
totalhrs = lambda x: [1 for _ in range(int(x // 3600))
] + [
(x % 3600 / 3600
or 0.1 # added for loop fix afterwards
)]
# check if Start and End are not in same hour
if start_time.hour != end_time.hour:
seconds = (end_time - start_time).total_seconds()
if seconds < 3600:
parts_ = [1] + totalhrs(seconds)
else:
parts_ = totalhrs(seconds)
else:
# parts_ define the loop iterations
parts_ = totalhrs((end_time - start_time).total_seconds())
sum_of_hrs = sum(parts_)
# for Constructing DF
new_hours = []
mins = []
# Clone data
start_time_ = start_time
end_time_ = end_time
for e in range(len(parts_)):
# print(parts_[e])
if sum_of_hrs != 0:
if sum_of_hrs > 1:
if end_time_.hour != start_time_.hour:
# Floor > based on the startTime +1 hour
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
#
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
if end_time_.hour != start_time_.hour:
# Get round off hour
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append((end_time_ - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
# Get DataFrame Build
df_out = pd.DataFrame()
df_out['hours'] = pd.Series(new_hours)
df_out['mins'] = pd.Series(mins)
df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])
return df_out
现在,让我们运行每次迭代的代码
time_val=[]
split_f_val=[]
split_field = 'duration'
time_field = 'ts'
# creating DataFrames for intermediate results!
df_final = pd.DataFrame()
df2 = pd.DataFrame()
for ix, row in df.iterrows():
time_val.append(row[str(time_field)])
split_f_val.append(int(row[str(split_field)]))
# Skipping First Element for Processing. Therefore, having minimum two data values
if ix !=0:
# getting Last Two Values
new_time_list=time_val[-2:]
new_data_list=split_f_val[-2:]
# get times to compare
start_time=new_time_list[: -1][0]
end_time=new_time_list[1:][0]
# get latest Data to divide
data_ = new_data_list[1:][0]
# print(start_time)
# print(end_time)
df2 = get_ratio_per_hour(start_time,end_time, data_ )
df_final = pd.concat([df_final
, df2], ignore_index=True)
else:
# Create Empty DataFrame for First Value.
df_final = pd.DataFrame([[np.nan,np.nan,np.nan] ],
columns=['hours', 'mins', 'ratios'])
df_final = pd.concat([df_final
, df2], ignore_index=True)
result = df_final.groupby(['hours'])['ratios'].sum()
Intermediate DataFrame:
hours mins ratios
0
0 2020-09-09 21:00:00 59.0 7.0
1 2020-09-09 22:00:00 60.0 7.0
2 2020-09-09 23:00:00 60.0 7.0
3 2020-09-10 00:00:00 10.0 1.0
0 2020-09-10 00:00:00 50.0 80.0
1 2020-09-10 01:00:00 31.0 50.0
0 2020-09-10 01:00:00 19.0 60.0
0 2020-09-10 01:00:00 1.0 50.0
0 2020-09-10 01:00:00 8.0 26.0
0 2020-09-10 01:00:00 1.0 36.0
1 2020-09-10 02:00:00 1.0 36.0
0 2020-09-10 02:00:00 50.0 51.0
0 2020-09-10 02:00:00 9.0 57.0
1 2020-09-10 03:00:00 1.0 6.0
0 2020-09-10 03:00:00 59.0 78.0
1 2020-09-10 04:00:00 1.0 1.0
0 2020-09-10 04:00:00 59.0 85.0
1 2020-09-10 05:00:00 60.0 87.0
2 2020-09-10 06:00:00 5.0 7.0
Final Output:
hours ratios
2020-09-09 21:00:00 7.0
2020-09-09 22:00:00 7.0
2020-09-09 23:00:00 7.0
2020-09-10 00:00:00 81.0
2020-09-10 01:00:00 222.0
2020-09-10 02:00:00 144.0
2020-09-10 03:00:00 84.0
2020-09-10 04:00:00 86.0
2020-09-10 05:00:00 87.0
2020-09-10 06:00:00 7.0
第一步是将“previous ts”列添加到源DataFrame:
df['tsPrev'] = df.ts.shift()
然后设置ts列为索引:
df.set_index('ts', inplace=True)
第三步创建辅助索引,由原来的组成
索引和“全时”:
ind = df.event.resample('H').asfreq().index.union(df.index)
然后创建一个辅助DataFrame,用刚刚创建的索引重新索引
和“回填”事件列:
df2 = df.reindex(ind)
df2.event = df2.event.bfill()
定义一个函数应用于来自 df2:
的每组行
def parts(grp):
lstRow = grp.iloc[-1] # Last row from group
if pd.isna(lstRow.tsPrev): # First group
return pd.Series([lstRow.duration], index=[grp.index[0]], dtype=int)
# Other groups
return -pd.Series([0], index=[lstRow.tsPrev]).append(grp.duration)\
.interpolate(method='index').round().diff(-1)[:-1].astype(int)
然后分两步为“produced”列生成源数据:
生成详细数据:
prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1],
sort=False).apply(parts).reset_index(level=0, drop=True)
来源是 df2 以这种方式分组,每个组都被终止
在 duration 列中有一行具有非空值。然后每组
使用 parts 函数处理。
结果是:
2020-09-09 21:00:00 12
2020-09-09 21:01:00 7
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 1
2020-09-10 00:10:00 80
2020-09-10 01:00:00 50
2020-09-10 01:31:00 60
2020-09-10 01:50:00 50
2020-09-10 01:51:00 26
2020-09-10 01:59:00 36
2020-09-10 02:00:00 36
2020-09-10 02:01:00 51
2020-09-10 02:51:00 57
2020-09-10 03:00:00 6
2020-09-10 03:01:00 78
2020-09-10 04:00:00 1
2020-09-10 04:01:00 85
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
dtype: int32
生成聚合数据,暂时也作为系列:
prod = prodDet.resample('H').sum().rename('produced')
这次 prodDet 被重新采样(按小时细分)并且
结果是值的总和。
结果是:
2020-09-09 21:00:00 19
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 81
2020-09-10 01:00:00 222
2020-09-10 02:00:00 144
2020-09-10 03:00:00 84
2020-09-10 04:00:00 86
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
Freq: H, Name: produced, dtype: int32
让我们描述一下prodDet的内容:
没有 2020-09-09 20:00:00 的行,因为没有源行是
从这个小时开始(您的数据从 21:01:00 开始)。
行21:00:00 12
来自第一个源行(你忘记了
它写出预期的结果)。
21:01:00、22:00:00、23 的行数:00:00 和 00:00:00 来自
行 00:10:00 a 22
的“分区”,作为您的一部分
预期结果。
具有 80 和 50 的行来自包含 130 的行, 分为
在具有小时 00:01:00 和 01:00:00.
的行之间
以此类推
现在我们开始assemble最后的结果。
Join prod(转换为 DataFrame)与 event 列:
result = prod.to_frame().join(df2.event)
添加 tsMin 列 - 每小时的最小 ts(如您所问
在其中一条评论中):
result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())
将索引更改为常规列并将其名称设置为ts
(就像在源 DataFrame 中一样):
result = result.reset_index().rename(columns={'index': 'ts'})
最后的结果是:
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
例如81 对于 00:00:00 的值是 1 和 80(第一个
由 130 行产生的部分),请参阅上面的 prodDet。
tsMin 列中的一些值是空的,对于没有
源行。
如果您想完全删除第一行的结果(使用
持续时间 == 12),将 return pd.Series([lstRow.duration]...
更改为
return pd.Series([0]...
(parts 函数的第 4 行)。
综上所述,我的解决方案更 pandasonic 并且明显更短
比你的(17 行(我的解决方案)与大约 70 行(你的),不包括评论)。
给定一个具有时间戳 (ts) 的 DataFrame,我想按小时计算这些数据(下采样)。以前由 ts 索引的值现在应该根据一小时剩余的分钟数分成比率。 [注意:在进行重采样时按 NaN 列的比率划分数据]
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
我想根据花费的时间与产出的比率来估算产出。这可以与 how many lines of code have been completed or find how many actual lines per hour?
例如:在“2020-09-10 00:10:00”我们有22。那么在21:01-00:10期间,我们根据
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
结果应该是这样的。
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
Problem with this approach:
在我看来,这种方法存在严重问题。如果您查看行 [1] -> 2020-09-10 07:00:00,我们有 4 个,我们需要将其划分为 3 个小时。考虑到基本持续时间值为 1(基本单位),我们得到
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
Source
duration event c ts
0 4 c event3.5 2020-09-10 07:00:00
1 4 c event3.5 2020-09-10 10:00:00
Expected Output
ts_hourly mins duration
0 2020-09-10 07:00:00 60.0 2
1 2020-09-10 08:00:00 60.0 1
2 2020-09-10 09:00:00 60.0 1
3 2020-09-10 10:00:00 0.0 0
我无法在 pandas
中找到解决方案,因此我创建了一个简单的解决方案 python。
基本上,我在排序后遍历所有值并将两个日期时间即 start_time
和 end_time
发送到一个函数,该函数进行处理。
def get_ratio_per_hour(start_time: list, end_time: list, data_: int):
# get total hours between the start and end, use this for looping
totalhrs = lambda x: [1 for _ in range(int(x // 3600))
] + [
(x % 3600 / 3600
or 0.1 # added for loop fix afterwards
)]
# check if Start and End are not in same hour
if start_time.hour != end_time.hour:
seconds = (end_time - start_time).total_seconds()
if seconds < 3600:
parts_ = [1] + totalhrs(seconds)
else:
parts_ = totalhrs(seconds)
else:
# parts_ define the loop iterations
parts_ = totalhrs((end_time - start_time).total_seconds())
sum_of_hrs = sum(parts_)
# for Constructing DF
new_hours = []
mins = []
# Clone data
start_time_ = start_time
end_time_ = end_time
for e in range(len(parts_)):
# print(parts_[e])
if sum_of_hrs != 0:
if sum_of_hrs > 1:
if end_time_.hour != start_time_.hour:
# Floor > based on the startTime +1 hour
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
#
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
if end_time_.hour != start_time_.hour:
# Get round off hour
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append((end_time_ - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
# Get DataFrame Build
df_out = pd.DataFrame()
df_out['hours'] = pd.Series(new_hours)
df_out['mins'] = pd.Series(mins)
df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])
return df_out
现在,让我们运行每次迭代的代码
time_val=[]
split_f_val=[]
split_field = 'duration'
time_field = 'ts'
# creating DataFrames for intermediate results!
df_final = pd.DataFrame()
df2 = pd.DataFrame()
for ix, row in df.iterrows():
time_val.append(row[str(time_field)])
split_f_val.append(int(row[str(split_field)]))
# Skipping First Element for Processing. Therefore, having minimum two data values
if ix !=0:
# getting Last Two Values
new_time_list=time_val[-2:]
new_data_list=split_f_val[-2:]
# get times to compare
start_time=new_time_list[: -1][0]
end_time=new_time_list[1:][0]
# get latest Data to divide
data_ = new_data_list[1:][0]
# print(start_time)
# print(end_time)
df2 = get_ratio_per_hour(start_time,end_time, data_ )
df_final = pd.concat([df_final
, df2], ignore_index=True)
else:
# Create Empty DataFrame for First Value.
df_final = pd.DataFrame([[np.nan,np.nan,np.nan] ],
columns=['hours', 'mins', 'ratios'])
df_final = pd.concat([df_final
, df2], ignore_index=True)
result = df_final.groupby(['hours'])['ratios'].sum()
Intermediate DataFrame:
hours mins ratios
0
0 2020-09-09 21:00:00 59.0 7.0
1 2020-09-09 22:00:00 60.0 7.0
2 2020-09-09 23:00:00 60.0 7.0
3 2020-09-10 00:00:00 10.0 1.0
0 2020-09-10 00:00:00 50.0 80.0
1 2020-09-10 01:00:00 31.0 50.0
0 2020-09-10 01:00:00 19.0 60.0
0 2020-09-10 01:00:00 1.0 50.0
0 2020-09-10 01:00:00 8.0 26.0
0 2020-09-10 01:00:00 1.0 36.0
1 2020-09-10 02:00:00 1.0 36.0
0 2020-09-10 02:00:00 50.0 51.0
0 2020-09-10 02:00:00 9.0 57.0
1 2020-09-10 03:00:00 1.0 6.0
0 2020-09-10 03:00:00 59.0 78.0
1 2020-09-10 04:00:00 1.0 1.0
0 2020-09-10 04:00:00 59.0 85.0
1 2020-09-10 05:00:00 60.0 87.0
2 2020-09-10 06:00:00 5.0 7.0
Final Output:
hours ratios
2020-09-09 21:00:00 7.0
2020-09-09 22:00:00 7.0
2020-09-09 23:00:00 7.0
2020-09-10 00:00:00 81.0
2020-09-10 01:00:00 222.0
2020-09-10 02:00:00 144.0
2020-09-10 03:00:00 84.0
2020-09-10 04:00:00 86.0
2020-09-10 05:00:00 87.0
2020-09-10 06:00:00 7.0
第一步是将“previous ts”列添加到源DataFrame:
df['tsPrev'] = df.ts.shift()
然后设置ts列为索引:
df.set_index('ts', inplace=True)
第三步创建辅助索引,由原来的组成 索引和“全时”:
ind = df.event.resample('H').asfreq().index.union(df.index)
然后创建一个辅助DataFrame,用刚刚创建的索引重新索引 和“回填”事件列:
df2 = df.reindex(ind)
df2.event = df2.event.bfill()
定义一个函数应用于来自 df2:
的每组行def parts(grp):
lstRow = grp.iloc[-1] # Last row from group
if pd.isna(lstRow.tsPrev): # First group
return pd.Series([lstRow.duration], index=[grp.index[0]], dtype=int)
# Other groups
return -pd.Series([0], index=[lstRow.tsPrev]).append(grp.duration)\
.interpolate(method='index').round().diff(-1)[:-1].astype(int)
然后分两步为“produced”列生成源数据:
生成详细数据:
prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1], sort=False).apply(parts).reset_index(level=0, drop=True)
来源是 df2 以这种方式分组,每个组都被终止 在 duration 列中有一行具有非空值。然后每组 使用 parts 函数处理。
结果是:
2020-09-09 21:00:00 12 2020-09-09 21:01:00 7 2020-09-09 22:00:00 7 2020-09-09 23:00:00 7 2020-09-10 00:00:00 1 2020-09-10 00:10:00 80 2020-09-10 01:00:00 50 2020-09-10 01:31:00 60 2020-09-10 01:50:00 50 2020-09-10 01:51:00 26 2020-09-10 01:59:00 36 2020-09-10 02:00:00 36 2020-09-10 02:01:00 51 2020-09-10 02:51:00 57 2020-09-10 03:00:00 6 2020-09-10 03:01:00 78 2020-09-10 04:00:00 1 2020-09-10 04:01:00 85 2020-09-10 05:00:00 87 2020-09-10 06:00:00 7 dtype: int32
生成聚合数据,暂时也作为系列:
prod = prodDet.resample('H').sum().rename('produced')
这次 prodDet 被重新采样(按小时细分)并且 结果是值的总和。
结果是:
2020-09-09 21:00:00 19 2020-09-09 22:00:00 7 2020-09-09 23:00:00 7 2020-09-10 00:00:00 81 2020-09-10 01:00:00 222 2020-09-10 02:00:00 144 2020-09-10 03:00:00 84 2020-09-10 04:00:00 86 2020-09-10 05:00:00 87 2020-09-10 06:00:00 7 Freq: H, Name: produced, dtype: int32
让我们描述一下prodDet的内容:
没有 2020-09-09 20:00:00 的行,因为没有源行是 从这个小时开始(您的数据从 21:01:00 开始)。
行
21:00:00 12
来自第一个源行(你忘记了 它写出预期的结果)。21:01:00、22:00:00、23 的行数:00:00 和 00:00:00 来自 行
00:10:00 a 22
的“分区”,作为您的一部分 预期结果。具有 80 和 50 的行来自包含 130 的行, 分为 在具有小时 00:01:00 和 01:00:00.
的行之间以此类推
现在我们开始assemble最后的结果。
Join prod(转换为 DataFrame)与 event 列:
result = prod.to_frame().join(df2.event)
添加 tsMin 列 - 每小时的最小 ts(如您所问 在其中一条评论中):
result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())
将索引更改为常规列并将其名称设置为ts (就像在源 DataFrame 中一样):
result = result.reset_index().rename(columns={'index': 'ts'})
最后的结果是:
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
例如81 对于 00:00:00 的值是 1 和 80(第一个 由 130 行产生的部分),请参阅上面的 prodDet。
tsMin 列中的一些值是空的,对于没有 源行。
如果您想完全删除第一行的结果(使用
持续时间 == 12),将 return pd.Series([lstRow.duration]...
更改为
return pd.Series([0]...
(parts 函数的第 4 行)。
综上所述,我的解决方案更 pandasonic 并且明显更短 比你的(17 行(我的解决方案)与大约 70 行(你的),不包括评论)。