按 1 月份的第一天筛选 pandas 个股票代码数据框
Filter pandas stock-ticker dataframe by first day in month of Jan
抱歉,我是 Python 的新手。
我有当前代码:
# Put data into a dataframe
df = pd.DataFrame(ZACKSP_raw_data)
""" Reformat dataframe data """
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
# Change date format to DD/MM/YYYY
df['date'] = df['date'].dt.strftime('%d/%m/%Y')
# Round closing share price to 2 digits
df['close'] = df['close'].round(2)
# Filter data for Jan
ZACKSP_data_StartOfJanYearMinus1 = df[df['date'] == '05/01/%s' % CurrentYearMinus1]
# Test
print(ZACKSP_data_StartOfJanYearMinus1.head())
其中returns数据格式:
现在我希望数组只保留 1 月份第一次记录的收盘价和 12 月最后一次记录的收盘价(对于每个代码)。我考虑过当天尝试使用通配符,然后使用 head() 或 tail() 之类的东西来实现这一目标,但我正在努力。有什么想法吗?
所有日期时间都排序的解决方案:
我认为第一行需要 concat
with drop_duplicates
,每个 ticker
需要最后一行。
还需要添加新列 year
s 用于每年的第一个和最后一个值,并带有代码。
df['year'] = pd.to_datetime(df['date']).dt.year
df1 = pd.concat([df.drop_duplicates(['ticker', 'year']),
df.drop_duplicates(['ticker', 'year'], keep='last')])
更通用的未排序 datetime
s 解决方案:
c = ['ticker','exchange','date','close']
df = pd.DataFrame({'date':pd.to_datetime(['2017-01-04','2017-01-12',
'2017-01-05',
'2018-01-02','2018-12-27','2017-12-27',
'2018-01-05','2018-01-12','2017-01-05',
'2017-01-12','2018-12-22','2017-12-22']),
'close':[4.56,5.45,4.32,5.67,5.23,4.78,7.43,8.67,
9.32,4.73,2.42,3.45],
'ticker':['BA','BA','BA','BA','BA','BA',
'AAPL','AAPL','AAPL','AAPL','AAPL','AAPL'],
'exchange':['NYSE'] * 6 + ['NSDQ'] * 6}, columns=c)
print (df)
ticker exchange date close
0 BA NYSE 2017-01-04 4.56
1 BA NYSE 2017-01-12 5.45
2 BA NYSE 2017-01-05 4.32
3 BA NYSE 2018-01-02 5.67
4 BA NYSE 2018-12-27 5.23
5 BA NYSE 2017-12-27 4.78
6 AAPL NSDQ 2018-01-05 7.43
7 AAPL NSDQ 2018-01-12 8.67
8 AAPL NSDQ 2017-01-05 9.32
9 AAPL NSDQ 2017-01-12 4.73
10 AAPL NSDQ 2018-12-22 2.42
11 AAPL NSDQ 2017-12-22 3.45
""" Reformat dataframe data """
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
# Round closing share price to 2 digits
df['close'] = df['close'].round(2)
#sorting dates for first date per ticker is first day in Jan and last day in Dec
df = df.sort_values('date')
#extract years from dates
df['year'] = pd.to_datetime(df['date']).dt.year
#get first rows per tickers and year
df1 = df.drop_duplicates(['ticker', 'year'])
print (df1)
ticker exchange date close year
0 BA NYSE 2017-01-04 4.56 2017
8 AAPL NASDAQ 2017-01-05 9.32 2017
3 BA NYSE 2018-01-02 5.67 2018
6 AAPL NASDAQ 2018-01-05 7.43 2018
#get last row per year and ticker
df2 = df.drop_duplicates(['ticker', 'year'], keep='last')
print (df2)
ticker exchange date close year
11 AAPL NASDAQ 2017-12-22 3.45 2017
5 BA NYSE 2017-12-27 4.78 2017
10 AAPL NASDAQ 2018-12-22 2.42 2018
4 BA NYSE 2018-12-27 5.23 2018
#join DataFrames together and sorting if necessary
df = pd.concat([df1, df2]).sort_values(['ticker','date'])
print (df)
ticker exchange date close year
8 AAPL NASDAQ 2017-01-05 9.32 2017
11 AAPL NASDAQ 2017-12-22 3.45 2017
6 AAPL NASDAQ 2018-01-05 7.43 2018
10 AAPL NASDAQ 2018-12-22 2.42 2018
0 BA NYSE 2017-01-04 4.56 2017
5 BA NYSE 2017-12-27 4.78 2017
3 BA NYSE 2018-01-02 5.67 2018
4 BA NYSE 2018-12-27 5.23 2018
另一种具有不同数据输出的解决方案,聚合 first
和 last
:
""" Reformat dataframe data """
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
# Round closing share price to 2 digits
df['close'] = df['close'].round(2)
#sorting dates for first date per ticker is first day in Jan and last day in Dec
df = df.sort_values('date')
#extract years from dates
df['year'] = pd.to_datetime(df['date']).dt.year
df = (df.groupby(['ticker','year'])['close']
.agg(['first','last'])
.reset_index())
print (df)
ticker year first last
0 AAPL 2017 9.32 3.45
1 AAPL 2018 7.43 2.42
2 BA 2017 4.56 4.78
3 BA 2018 5.67 5.23
您想 df.groupby('ticker')
,然后按月分组,过滤月 =='Dec' 并取 tail()
,过滤月 =='Jan' 和取head(),然后ungroup()。
(如果您 post 可重现数据,我将 post 代码执行此操作。)
阅读有关 pandas 的文档
Group By: split-apply-combine paradigm, one of the key paradigms in data science. For examples on SO, see tag split-apply-combine.
抱歉,我是 Python 的新手。
我有当前代码:
# Put data into a dataframe
df = pd.DataFrame(ZACKSP_raw_data)
""" Reformat dataframe data """
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
# Change date format to DD/MM/YYYY
df['date'] = df['date'].dt.strftime('%d/%m/%Y')
# Round closing share price to 2 digits
df['close'] = df['close'].round(2)
# Filter data for Jan
ZACKSP_data_StartOfJanYearMinus1 = df[df['date'] == '05/01/%s' % CurrentYearMinus1]
# Test
print(ZACKSP_data_StartOfJanYearMinus1.head())
其中returns数据格式:
现在我希望数组只保留 1 月份第一次记录的收盘价和 12 月最后一次记录的收盘价(对于每个代码)。我考虑过当天尝试使用通配符,然后使用 head() 或 tail() 之类的东西来实现这一目标,但我正在努力。有什么想法吗?
所有日期时间都排序的解决方案:
我认为第一行需要 concat
with drop_duplicates
,每个 ticker
需要最后一行。
还需要添加新列 year
s 用于每年的第一个和最后一个值,并带有代码。
df['year'] = pd.to_datetime(df['date']).dt.year
df1 = pd.concat([df.drop_duplicates(['ticker', 'year']),
df.drop_duplicates(['ticker', 'year'], keep='last')])
更通用的未排序 datetime
s 解决方案:
c = ['ticker','exchange','date','close']
df = pd.DataFrame({'date':pd.to_datetime(['2017-01-04','2017-01-12',
'2017-01-05',
'2018-01-02','2018-12-27','2017-12-27',
'2018-01-05','2018-01-12','2017-01-05',
'2017-01-12','2018-12-22','2017-12-22']),
'close':[4.56,5.45,4.32,5.67,5.23,4.78,7.43,8.67,
9.32,4.73,2.42,3.45],
'ticker':['BA','BA','BA','BA','BA','BA',
'AAPL','AAPL','AAPL','AAPL','AAPL','AAPL'],
'exchange':['NYSE'] * 6 + ['NSDQ'] * 6}, columns=c)
print (df)
ticker exchange date close
0 BA NYSE 2017-01-04 4.56
1 BA NYSE 2017-01-12 5.45
2 BA NYSE 2017-01-05 4.32
3 BA NYSE 2018-01-02 5.67
4 BA NYSE 2018-12-27 5.23
5 BA NYSE 2017-12-27 4.78
6 AAPL NSDQ 2018-01-05 7.43
7 AAPL NSDQ 2018-01-12 8.67
8 AAPL NSDQ 2017-01-05 9.32
9 AAPL NSDQ 2017-01-12 4.73
10 AAPL NSDQ 2018-12-22 2.42
11 AAPL NSDQ 2017-12-22 3.45
""" Reformat dataframe data """
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
# Round closing share price to 2 digits
df['close'] = df['close'].round(2)
#sorting dates for first date per ticker is first day in Jan and last day in Dec
df = df.sort_values('date')
#extract years from dates
df['year'] = pd.to_datetime(df['date']).dt.year
#get first rows per tickers and year
df1 = df.drop_duplicates(['ticker', 'year'])
print (df1)
ticker exchange date close year
0 BA NYSE 2017-01-04 4.56 2017
8 AAPL NASDAQ 2017-01-05 9.32 2017
3 BA NYSE 2018-01-02 5.67 2018
6 AAPL NASDAQ 2018-01-05 7.43 2018
#get last row per year and ticker
df2 = df.drop_duplicates(['ticker', 'year'], keep='last')
print (df2)
ticker exchange date close year
11 AAPL NASDAQ 2017-12-22 3.45 2017
5 BA NYSE 2017-12-27 4.78 2017
10 AAPL NASDAQ 2018-12-22 2.42 2018
4 BA NYSE 2018-12-27 5.23 2018
#join DataFrames together and sorting if necessary
df = pd.concat([df1, df2]).sort_values(['ticker','date'])
print (df)
ticker exchange date close year
8 AAPL NASDAQ 2017-01-05 9.32 2017
11 AAPL NASDAQ 2017-12-22 3.45 2017
6 AAPL NASDAQ 2018-01-05 7.43 2018
10 AAPL NASDAQ 2018-12-22 2.42 2018
0 BA NYSE 2017-01-04 4.56 2017
5 BA NYSE 2017-12-27 4.78 2017
3 BA NYSE 2018-01-02 5.67 2018
4 BA NYSE 2018-12-27 5.23 2018
另一种具有不同数据输出的解决方案,聚合 first
和 last
:
""" Reformat dataframe data """
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
# Round closing share price to 2 digits
df['close'] = df['close'].round(2)
#sorting dates for first date per ticker is first day in Jan and last day in Dec
df = df.sort_values('date')
#extract years from dates
df['year'] = pd.to_datetime(df['date']).dt.year
df = (df.groupby(['ticker','year'])['close']
.agg(['first','last'])
.reset_index())
print (df)
ticker year first last
0 AAPL 2017 9.32 3.45
1 AAPL 2018 7.43 2.42
2 BA 2017 4.56 4.78
3 BA 2018 5.67 5.23
您想 df.groupby('ticker')
,然后按月分组,过滤月 =='Dec' 并取 tail()
,过滤月 =='Jan' 和取head(),然后ungroup()。
(如果您 post 可重现数据,我将 post 代码执行此操作。)
阅读有关 pandas 的文档 Group By: split-apply-combine paradigm, one of the key paradigms in data science. For examples on SO, see tag split-apply-combine.