Pandas 将字符串元素转换为多索引组件

Pandas converting string elements into multi index components

我有这样一个 DataFrame:

pd.read_csv("https://raw.githubusercontent.com/fja05680/sp500/master/S%26P%20500%20Historical%20Components%20%26%20Changes(03-14-2022).csv")

out:
    date               tickers
0   1996-01-02  AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
1   1996-01-03  AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
2   1996-01-04  AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
3   1996-01-10  AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
4   1996-01-11  AAL,AAMRQ,AAPL,ABI,ABS,ABT,ABX,ACKH,ACV,ADM,AD...
... ... ...
2643    2022-01-20  A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2644    2022-02-02  A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2645    2022-02-15  A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2646    2022-02-17  A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2647    2022-03-02  A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,...
2648 rows × 2 columns

我想将此数据框转换为如下所示的多索引数据框:



ticker  date     random value       
A   2016-01-04        x
    2016-01-05        x
    2016-01-06        x
    2016-01-07        x
    2016-01-08        x
... ... ... ... ... ... ...
ZYXI    2022-03-17    x
        2022-03-18    x
        2022-03-21    x
        2022-03-22    x
        2022-03-23    x

如有任何帮助,我们将不胜感激!

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/fja05680/sp500/master/S%26P%20500%20Historical%20Components%20%26%20Changes(03-14-2022).csv")

# convert string to list of tickers
df.tickers=df.tickers.str.split(',')

# explode list to rows
df = df.explode("tickers")

# make multi index, order levels and sort
df = df.set_index(['tickers', 'date']).sort_index()

# create random col
df['random value'] = 'x'

输出:

                   random value
tickers date                   
A       2000-06-05            x
        2000-06-06            x
        2000-06-07            x
        2000-06-08            x
        2000-06-09            x
...                         ...
ZTS     2022-01-20            x
        2022-02-02            x
        2022-02-15            x
        2022-02-17            x
        2022-03-02            x

[1315027 rows x 1 columns]

你可以试试:

import pandas as pd
import numpy as np

# df = pd.read_csv('http://...')
out = df.assign(tickers=df['tickers'].str.split(',')).explode('tickers')
out = pd.DataFrame({'random': np.random.normal(50, 20, len(out))}, 
                   index=pd.MultiIndex.from_frame(out).swaplevel().sort_values())

输出:

>>> out
                       random
tickers date                 
A       2000-06-05  49.576047
        2000-06-06  80.663479
        2000-06-07  67.021320
        2000-06-08  39.380321
        2000-06-09  39.732465
...                       ...
ZTS     2022-01-20  39.031418
        2022-02-02  49.697928
        2022-02-15  23.545380
        2022-02-17  44.048933
        2022-03-02  41.444091

[1315027 rows x 1 columns]

更新

一个班轮版本:

out = (df.assign(tickers=df['tickers'].str.split(',')).explode('tickers')
         .set_index(['tickers', 'date']).sort_index()
         .assign(random=lambda x: np.random.normal(50, 20, len(x))))