Pandas 将字符串元素转换为多索引组件
Pandas converting string elements into multi index components
我有这样一个 DataFrame:
pd.read_csv("https://raw.githubusercontent.com/fja05680/sp500/master/S%26P%20500%20Historical%20Components%20%26%20Changes(03-14-2022).csv")
out:
date tickers
0 1996-01-02 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
1 1996-01-03 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
2 1996-01-04 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
3 1996-01-10 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
4 1996-01-11 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
... ... ...
2643 2022-01-20 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2644 2022-02-02 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2645 2022-02-15 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2646 2022-02-17 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2647 2022-03-02 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2648 rows × 2 columns
我想将此数据框转换为如下所示的多索引数据框:
ticker date random value
A 2016-01-04 x
2016-01-05 x
2016-01-06 x
2016-01-07 x
2016-01-08 x
... ... ... ... ... ... ...
ZYXI 2022-03-17 x
2022-03-18 x
2022-03-21 x
2022-03-22 x
2022-03-23 x
如有任何帮助,我们将不胜感激!
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/fja05680/sp500/master/S%26P%20500%20Historical%20Components%20%26%20Changes(03-14-2022).csv")
# convert string to list of tickers
df.tickers=df.tickers.str.split(',')
# explode list to rows
df = df.explode("tickers")
# make multi index, order levels and sort
df = df.set_index(['tickers', 'date']).sort_index()
# create random col
df['random value'] = 'x'
输出:
random value
tickers date
A 2000-06-05 x
2000-06-06 x
2000-06-07 x
2000-06-08 x
2000-06-09 x
... ...
ZTS 2022-01-20 x
2022-02-02 x
2022-02-15 x
2022-02-17 x
2022-03-02 x
[1315027 rows x 1 columns]
你可以试试:
import pandas as pd
import numpy as np
# df = pd.read_csv('http://...')
out = df.assign(tickers=df['tickers'].str.split(',')).explode('tickers')
out = pd.DataFrame({'random': np.random.normal(50, 20, len(out))},
index=pd.MultiIndex.from_frame(out).swaplevel().sort_values())
输出:
>>> out
random
tickers date
A 2000-06-05 49.576047
2000-06-06 80.663479
2000-06-07 67.021320
2000-06-08 39.380321
2000-06-09 39.732465
... ...
ZTS 2022-01-20 39.031418
2022-02-02 49.697928
2022-02-15 23.545380
2022-02-17 44.048933
2022-03-02 41.444091
[1315027 rows x 1 columns]
更新
一个班轮版本:
out = (df.assign(tickers=df['tickers'].str.split(',')).explode('tickers')
.set_index(['tickers', 'date']).sort_index()
.assign(random=lambda x: np.random.normal(50, 20, len(x))))
我有这样一个 DataFrame:
pd.read_csv("https://raw.githubusercontent.com/fja05680/sp500/master/S%26P%20500%20Historical%20Components%20%26%20Changes(03-14-2022).csv")
out:
date tickers
0 1996-01-02 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
1 1996-01-03 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
2 1996-01-04 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
3 1996-01-10 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
4 1996-01-11 AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
... ... ...
2643 2022-01-20 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2644 2022-02-02 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2645 2022-02-15 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2646 2022-02-17 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2647 2022-03-02 A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2648 rows × 2 columns
我想将此数据框转换为如下所示的多索引数据框:
ticker date random value
A 2016-01-04 x
2016-01-05 x
2016-01-06 x
2016-01-07 x
2016-01-08 x
... ... ... ... ... ... ...
ZYXI 2022-03-17 x
2022-03-18 x
2022-03-21 x
2022-03-22 x
2022-03-23 x
如有任何帮助,我们将不胜感激!
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/fja05680/sp500/master/S%26P%20500%20Historical%20Components%20%26%20Changes(03-14-2022).csv")
# convert string to list of tickers
df.tickers=df.tickers.str.split(',')
# explode list to rows
df = df.explode("tickers")
# make multi index, order levels and sort
df = df.set_index(['tickers', 'date']).sort_index()
# create random col
df['random value'] = 'x'
输出:
random value
tickers date
A 2000-06-05 x
2000-06-06 x
2000-06-07 x
2000-06-08 x
2000-06-09 x
... ...
ZTS 2022-01-20 x
2022-02-02 x
2022-02-15 x
2022-02-17 x
2022-03-02 x
[1315027 rows x 1 columns]
你可以试试:
import pandas as pd
import numpy as np
# df = pd.read_csv('http://...')
out = df.assign(tickers=df['tickers'].str.split(',')).explode('tickers')
out = pd.DataFrame({'random': np.random.normal(50, 20, len(out))},
index=pd.MultiIndex.from_frame(out).swaplevel().sort_values())
输出:
>>> out
random
tickers date
A 2000-06-05 49.576047
2000-06-06 80.663479
2000-06-07 67.021320
2000-06-08 39.380321
2000-06-09 39.732465
... ...
ZTS 2022-01-20 39.031418
2022-02-02 49.697928
2022-02-15 23.545380
2022-02-17 44.048933
2022-03-02 41.444091
[1315027 rows x 1 columns]
更新
一个班轮版本:
out = (df.assign(tickers=df['tickers'].str.split(',')).explode('tickers')
.set_index(['tickers', 'date']).sort_index()
.assign(random=lambda x: np.random.normal(50, 20, len(x))))