python pandas 按另一个系列、多列过滤数据框
python pandas filter dataframe by another series, multiple columns
在获得一系列交付数量最高的日子后,如何过滤掉那些日子的原始数据框?鉴于这两个:
most_liquid_contracts.head(20)
Out[32]:
2007-04-26 706
2007-04-27 706
2007-04-29 706
2007-04-30 706
2007-05-01 706
2007-05-02 706
2007-05-03 706
2007-05-04 706
2007-05-06 706
2007-05-07 706
2007-05-08 706
2007-05-09 706
2007-05-10 706
2007-05-11 706
2007-05-13 706
2007-05-14 706
2007-05-15 706
2007-05-16 706
2007-05-17 706
2007-05-18 706
dtype: int64
df.head(20).to_string
Out[40]:
<bound method DataFrame.to_string of
delivery volume
2007-04-27 11:55:00+01:00 705 1
2007-04-27 13:46:00+01:00 705 1
2007-04-27 14:15:00+01:00 705 1
2007-04-27 14:33:00+01:00 705 1
2007-04-27 14:35:00+01:00 705 1
2007-04-27 17:05:00+01:00 705 16
2007-04-27 17:07:00+01:00 705 1
2007-04-27 17:12:00+01:00 705 1
2007-04-27 17:46:00+01:00 705 1
2007-04-27 18:25:00+01:00 705 2
2007-04-26 23:00:00+01:00 706 10
2007-04-26 23:01:00+01:00 706 12
2007-04-26 23:02:00+01:00 706 1
2007-04-26 23:05:00+01:00 706 21
2007-04-26 23:06:00+01:00 706 10
2007-04-26 23:07:00+01:00 706 19
2007-04-26 23:08:00+01:00 706 1
2007-04-26 23:13:00+01:00 706 10
2007-04-26 23:14:00+01:00 706 62
2007-04-26 23:15:00+01:00 706 3>
我试过:
liquid = df[df.index.date==most_liquid_contracts.index & df['delivery']==most_liquid_contracts]
或者我是否需要合并?它似乎不太优雅,我也不确定..我试过:
# ATTEMPT 1
most_liquid_contracts.index = pd.to_datetime(most_liquid_contracts.index, unit='d')
df['days'] = pd.to_datetime(df.index.date, unit='d')
mlc = most_liquid_contracts.to_frame(name='delivery')
mlc['days'] = mlc.index.date
data = pd.merge(mlc, df, on=['delivery', 'days'], left_index=True)
# ATTEMPT 2
liquid = pd.merge(mlc, df, on='delivery', how='inner', left_index=True)
# this gets me closer (ie. retains granularity), but somehow seems to be an outer join? it includes the union but not the intersection. this should be a subset of df, but instead has about x50 the rows, at around 195B. df originally has 4B
但我似乎无法在原始 "df" 中保留我需要的 minute-level 粒度。本质上,我只需要 "df" 流动性最强的合约(来自 most_liquid_contracts 系列;例如,4 月 27 日只包括“706”标签的合约,4 月 29 日只包括“706”-标签合同)。然后是完全相反的第二个 df:所有其他合约的 df(即 不是 最具流动性)。
更新:更多信息--
棘手的部分是合并具有不同日期时间分辨率的索引的两个 series/dataframes。一旦你把它们智能地组合起来,你就可以正常过滤了。
# Make sure your series has a name
# Make sure the index is pure dates, not date 00:00:00
most_liquid_contracts.name = 'most'
most_liquid_conttracts.index = most_liquid_contracts.index.date
data = df
data['day'] = data.index.date
combined = data.join(most_liquid_contracts, on='day', how='left')
现在你可以做类似的事情
combined[combined.delivery == combined.most]
这将生成 data
(df
) 中的行,其中 data.delivery
等于当天 most_liquid_contracts
中的值。
我假设我对您的理解是正确的,并且 most_liquid_contracts 系列是包含某个整数 N 的 N 个最大交货量的系列。您想要过滤 df 以仅包括交货天数数量足够高,可以上榜。因此,您可以简单地删除 df 中不大于 most_liquid_contracts.
最小值的所有内容
threshold = min(most_liquid_contracts)
filtered = df[df['delivery'] >= threshold]
在获得一系列交付数量最高的日子后,如何过滤掉那些日子的原始数据框?鉴于这两个:
most_liquid_contracts.head(20)
Out[32]:
2007-04-26 706
2007-04-27 706
2007-04-29 706
2007-04-30 706
2007-05-01 706
2007-05-02 706
2007-05-03 706
2007-05-04 706
2007-05-06 706
2007-05-07 706
2007-05-08 706
2007-05-09 706
2007-05-10 706
2007-05-11 706
2007-05-13 706
2007-05-14 706
2007-05-15 706
2007-05-16 706
2007-05-17 706
2007-05-18 706
dtype: int64
df.head(20).to_string
Out[40]:
<bound method DataFrame.to_string of
delivery volume
2007-04-27 11:55:00+01:00 705 1
2007-04-27 13:46:00+01:00 705 1
2007-04-27 14:15:00+01:00 705 1
2007-04-27 14:33:00+01:00 705 1
2007-04-27 14:35:00+01:00 705 1
2007-04-27 17:05:00+01:00 705 16
2007-04-27 17:07:00+01:00 705 1
2007-04-27 17:12:00+01:00 705 1
2007-04-27 17:46:00+01:00 705 1
2007-04-27 18:25:00+01:00 705 2
2007-04-26 23:00:00+01:00 706 10
2007-04-26 23:01:00+01:00 706 12
2007-04-26 23:02:00+01:00 706 1
2007-04-26 23:05:00+01:00 706 21
2007-04-26 23:06:00+01:00 706 10
2007-04-26 23:07:00+01:00 706 19
2007-04-26 23:08:00+01:00 706 1
2007-04-26 23:13:00+01:00 706 10
2007-04-26 23:14:00+01:00 706 62
2007-04-26 23:15:00+01:00 706 3>
我试过:
liquid = df[df.index.date==most_liquid_contracts.index & df['delivery']==most_liquid_contracts]
或者我是否需要合并?它似乎不太优雅,我也不确定..我试过:
# ATTEMPT 1
most_liquid_contracts.index = pd.to_datetime(most_liquid_contracts.index, unit='d')
df['days'] = pd.to_datetime(df.index.date, unit='d')
mlc = most_liquid_contracts.to_frame(name='delivery')
mlc['days'] = mlc.index.date
data = pd.merge(mlc, df, on=['delivery', 'days'], left_index=True)
# ATTEMPT 2
liquid = pd.merge(mlc, df, on='delivery', how='inner', left_index=True)
# this gets me closer (ie. retains granularity), but somehow seems to be an outer join? it includes the union but not the intersection. this should be a subset of df, but instead has about x50 the rows, at around 195B. df originally has 4B
但我似乎无法在原始 "df" 中保留我需要的 minute-level 粒度。本质上,我只需要 "df" 流动性最强的合约(来自 most_liquid_contracts 系列;例如,4 月 27 日只包括“706”标签的合约,4 月 29 日只包括“706”-标签合同)。然后是完全相反的第二个 df:所有其他合约的 df(即 不是 最具流动性)。
更新:更多信息--
棘手的部分是合并具有不同日期时间分辨率的索引的两个 series/dataframes。一旦你把它们智能地组合起来,你就可以正常过滤了。
# Make sure your series has a name
# Make sure the index is pure dates, not date 00:00:00
most_liquid_contracts.name = 'most'
most_liquid_conttracts.index = most_liquid_contracts.index.date
data = df
data['day'] = data.index.date
combined = data.join(most_liquid_contracts, on='day', how='left')
现在你可以做类似的事情
combined[combined.delivery == combined.most]
这将生成 data
(df
) 中的行,其中 data.delivery
等于当天 most_liquid_contracts
中的值。
我假设我对您的理解是正确的,并且 most_liquid_contracts 系列是包含某个整数 N 的 N 个最大交货量的系列。您想要过滤 df 以仅包括交货天数数量足够高,可以上榜。因此,您可以简单地删除 df 中不大于 most_liquid_contracts.
最小值的所有内容threshold = min(most_liquid_contracts)
filtered = df[df['delivery'] >= threshold]