根据第二个过滤数据框的结果过滤条件数据框:寻求效率提升
Filter dataframe for conditions based on results from second filtered dataframe: Seeking efficiency boost
正在为我笨重的破损寻求更有效的解决方案
我有一个非常大的数据框,其中包含许多符号的所有详细股票数据。看起来像这样 df
:
symbol time open high low close volume
0 AEHR 1628656380 652 652 652 652 200
1 AEHR 1628660580 646 646 646 646 100
2 AEHR 1628668380 634 634 634 634 300
3 AEHR 1628668920 606 608 606 608 402
4 AEHR 1628669100 615 615 615 615 100
... ... ... ... ... ... ... ...
4266849 UPST 1631303160 26753 26753 26753 26753 163
4266850 UPST 1631303340 26805 26805 26805 26805 231
4266851 UPST 1631303520 26768 26768 26768 26768 226
4266852 UPST 1631303760 26819 26819 26819 26819 1964
4266853 UPST 1631303940 26899 26899 26899 26899 157
我想切片以进行高效的进一步处理。我只想保留与此“交易”数据库列表相关的数据 db_buy
:
Symbol Time
0 AMD 2019-12-12 09:36:00
1 AMD 2020-01-16 09:33:00
2 BITF 2021-08-03 09:47:00
3 DOCN 2021-06-14 09:32:00
4 NVDA 2020-07-29 09:38:00
5 NVDA 2020-09-25 10:34:00
6 UPST 2021-02-09 09:32:00
7 UPST 2021-03-18 09:32:00
我只想将数据保留在
df['symbol']
= df_buy['Symbol']
;并且,
df['time']
> df_buy['Time']
- BDay(25)
我的解决方案感觉可以在几行内完成:
db_buy.sort_index(inplace=True)
no_match = np.nan
# Find start date required for slice
tickers = pd.DataFrame()
tickers['Symbol'] = db_buy['Symbol'].unique()
tickers.sort_index(inplace=True)
print(tickers)
for i, row in tickers.iterrows():
cond = (db_buy['Symbol'] == tickers['Symbol'])
same_symbol = db_buy[cond] # gets df of same tickers ['Symbol', 'Time']
min_date = same_symbol['Time'].min() # returns min value
match = no_match if not cond.any() else min_date
# ^ Returns 'NAN' if cond=false, else match=first row of df which is true
tickers.loc[i, 'Time'] = match
# Get data for stocks in buy list only. Time data starting 25 days prior to buy date to now
cond = (
(df['Symbol'] == tickers['Symbol']) &
(df['Time'] > tickers['Time'] + BDay(-25))
)
df = df[cond]
cond = (db_buy['Symbol'] == tickers['Symbol'])
ValueError:只能比较相同标记的系列对象
您可以对两个条件组合使用简单切片:
# ensure datetime type
df['time'] = pd.to_datetime(df['time'], unit='s')
d_buy['Time'] = pd.to_datetime(d_buy['Time'])
# slice data
df[ df['symbol'].isin(d_buy['Symbol'].unique()) # symbol is in Symbol
& df['time'].gt(d_buy['Time'].min()-pd.Timedelta('25d')) # time > min(Time)-25days
]
输出:
symbol time open high low close volume
4266849 UPST 2021-09-10 19:46:00 26753 26753 26753 26753 163
4266850 UPST 2021-09-10 19:49:00 26805 26805 26805 26805 231
4266851 UPST 2021-09-10 19:52:00 26768 26768 26768 26768 226
4266852 UPST 2021-09-10 19:56:00 26819 26819 26819 26819 1964
4266853 UPST 2021-09-10 19:59:00 26899 26899 26899 26899 157
正在为我笨重的破损寻求更有效的解决方案
我有一个非常大的数据框,其中包含许多符号的所有详细股票数据。看起来像这样 df
:
symbol time open high low close volume
0 AEHR 1628656380 652 652 652 652 200
1 AEHR 1628660580 646 646 646 646 100
2 AEHR 1628668380 634 634 634 634 300
3 AEHR 1628668920 606 608 606 608 402
4 AEHR 1628669100 615 615 615 615 100
... ... ... ... ... ... ... ...
4266849 UPST 1631303160 26753 26753 26753 26753 163
4266850 UPST 1631303340 26805 26805 26805 26805 231
4266851 UPST 1631303520 26768 26768 26768 26768 226
4266852 UPST 1631303760 26819 26819 26819 26819 1964
4266853 UPST 1631303940 26899 26899 26899 26899 157
我想切片以进行高效的进一步处理。我只想保留与此“交易”数据库列表相关的数据 db_buy
:
Symbol Time
0 AMD 2019-12-12 09:36:00
1 AMD 2020-01-16 09:33:00
2 BITF 2021-08-03 09:47:00
3 DOCN 2021-06-14 09:32:00
4 NVDA 2020-07-29 09:38:00
5 NVDA 2020-09-25 10:34:00
6 UPST 2021-02-09 09:32:00
7 UPST 2021-03-18 09:32:00
我只想将数据保留在
df['symbol']
=df_buy['Symbol']
;并且,df['time']
>df_buy['Time']
-BDay(25)
我的解决方案感觉可以在几行内完成:
db_buy.sort_index(inplace=True)
no_match = np.nan
# Find start date required for slice
tickers = pd.DataFrame()
tickers['Symbol'] = db_buy['Symbol'].unique()
tickers.sort_index(inplace=True)
print(tickers)
for i, row in tickers.iterrows():
cond = (db_buy['Symbol'] == tickers['Symbol'])
same_symbol = db_buy[cond] # gets df of same tickers ['Symbol', 'Time']
min_date = same_symbol['Time'].min() # returns min value
match = no_match if not cond.any() else min_date
# ^ Returns 'NAN' if cond=false, else match=first row of df which is true
tickers.loc[i, 'Time'] = match
# Get data for stocks in buy list only. Time data starting 25 days prior to buy date to now
cond = (
(df['Symbol'] == tickers['Symbol']) &
(df['Time'] > tickers['Time'] + BDay(-25))
)
df = df[cond]
cond = (db_buy['Symbol'] == tickers['Symbol'])
ValueError:只能比较相同标记的系列对象
您可以对两个条件组合使用简单切片:
# ensure datetime type
df['time'] = pd.to_datetime(df['time'], unit='s')
d_buy['Time'] = pd.to_datetime(d_buy['Time'])
# slice data
df[ df['symbol'].isin(d_buy['Symbol'].unique()) # symbol is in Symbol
& df['time'].gt(d_buy['Time'].min()-pd.Timedelta('25d')) # time > min(Time)-25days
]
输出:
symbol time open high low close volume
4266849 UPST 2021-09-10 19:46:00 26753 26753 26753 26753 163
4266850 UPST 2021-09-10 19:49:00 26805 26805 26805 26805 231
4266851 UPST 2021-09-10 19:52:00 26768 26768 26768 26768 226
4266852 UPST 2021-09-10 19:56:00 26819 26819 26819 26819 1964
4266853 UPST 2021-09-10 19:59:00 26899 26899 26899 26899 157