在 pandas 中使用 double-string headers 重塑数据
Reshape data with double-string headers in pandas
我有以下数据框:
AAPL,Price AAPL,Volume ... GOOG,Ivol GOOG,Shares
Date
0 2019-12-25 21.5 1879 20.0 2010
1 2019-12-26 22.1 1887 19.9 2000
2 2019-12-27 23.0 1888 19.9 2045
3 2019-12-30 22.3 1887 NaN 2050
4 2019-12-31 22.4 1900 20.1 1998
我需要对这些股票的每一个特征进行 运行 回归,并且我必须附上一组特定的假人。因此,我的目标是重塑数据库,使我有一个双索引,第一位是日期,第二位是股票名称,即
Date Stock Price Volume ... Ivol Shares
2019-12-25 AAPL 21.5 1879 ... 22.1 3121
... ... ... ... ... ... ...
2019-12-25 GOOG 45.8 NaN ... 20.0 2000
...
2019-12-25 VER NaN NaN ... NaN NaN
2019-12-26 AAPL ...
...
2019-12-31 VER 42.4 1900 ... 50.1 1998
我的问题是我不知道如何处理列名,因为它们是 'Stock1,Feature1' 类型的字符串,因此我不知道如何生成适当的 double-index。有人可以帮忙吗?请注意,每只股票的特征都是相同的,即特征列表不会因股票而异。
当然,在重塑数据方面,我对不同类型的解决方案持开放态度
您可以使用 pd.wide_to_long
重塑它。为了获得正确的格式,我将首先修改列索引以使 'AAPL,Price'
-> 'Price,AAPL'
.
df.columns = [','.join(x[::-1]) for x in df.columns.str.split(',')]
stubs = df.columns.str.split(',').str[0].unique().tolist()
#['Price', 'Volume', 'Ivol', 'Shares']
df = df.reset_index('Date') # Need it out of index
df = pd.wide_to_long(df, i='Date', j='Stock', stubnames=stubs, sep=',', suffix='.*')
#df, based on the 4 columns provided:
Price Volume Ivol Shares
Date Stock
2019-12-25 AAPL 21.5 1879.0 NaN NaN
GOOG NaN NaN 20.0 2010.0
2019-12-26 AAPL 22.1 1887.0 NaN NaN
GOOG NaN NaN 19.9 2000.0
2019-12-27 AAPL 23.0 1888.0 NaN NaN
GOOG NaN NaN 19.9 2045.0
2019-12-30 AAPL 22.3 1887.0 NaN NaN
GOOG NaN NaN NaN 2050.0
2019-12-31 AAPL 22.4 1900.0 NaN NaN
GOOG NaN NaN 20.1 1998.0
也可能更简单,您可以创建列 MultiIndex 然后 stack
df.columns = pd.MultiIndex.from_tuples([tuple(x) for x in df.columns.str.split(',')])
df = df.stack(level=0) # Index level won't have a name.
我有以下数据框:
AAPL,Price AAPL,Volume ... GOOG,Ivol GOOG,Shares
Date
0 2019-12-25 21.5 1879 20.0 2010
1 2019-12-26 22.1 1887 19.9 2000
2 2019-12-27 23.0 1888 19.9 2045
3 2019-12-30 22.3 1887 NaN 2050
4 2019-12-31 22.4 1900 20.1 1998
我需要对这些股票的每一个特征进行 运行 回归,并且我必须附上一组特定的假人。因此,我的目标是重塑数据库,使我有一个双索引,第一位是日期,第二位是股票名称,即
Date Stock Price Volume ... Ivol Shares
2019-12-25 AAPL 21.5 1879 ... 22.1 3121
... ... ... ... ... ... ...
2019-12-25 GOOG 45.8 NaN ... 20.0 2000
...
2019-12-25 VER NaN NaN ... NaN NaN
2019-12-26 AAPL ...
...
2019-12-31 VER 42.4 1900 ... 50.1 1998
我的问题是我不知道如何处理列名,因为它们是 'Stock1,Feature1' 类型的字符串,因此我不知道如何生成适当的 double-index。有人可以帮忙吗?请注意,每只股票的特征都是相同的,即特征列表不会因股票而异。 当然,在重塑数据方面,我对不同类型的解决方案持开放态度
您可以使用 pd.wide_to_long
重塑它。为了获得正确的格式,我将首先修改列索引以使 'AAPL,Price'
-> 'Price,AAPL'
.
df.columns = [','.join(x[::-1]) for x in df.columns.str.split(',')]
stubs = df.columns.str.split(',').str[0].unique().tolist()
#['Price', 'Volume', 'Ivol', 'Shares']
df = df.reset_index('Date') # Need it out of index
df = pd.wide_to_long(df, i='Date', j='Stock', stubnames=stubs, sep=',', suffix='.*')
#df, based on the 4 columns provided:
Price Volume Ivol Shares
Date Stock
2019-12-25 AAPL 21.5 1879.0 NaN NaN
GOOG NaN NaN 20.0 2010.0
2019-12-26 AAPL 22.1 1887.0 NaN NaN
GOOG NaN NaN 19.9 2000.0
2019-12-27 AAPL 23.0 1888.0 NaN NaN
GOOG NaN NaN 19.9 2045.0
2019-12-30 AAPL 22.3 1887.0 NaN NaN
GOOG NaN NaN NaN 2050.0
2019-12-31 AAPL 22.4 1900.0 NaN NaN
GOOG NaN NaN 20.1 1998.0
也可能更简单,您可以创建列 MultiIndex 然后 stack
df.columns = pd.MultiIndex.from_tuples([tuple(x) for x in df.columns.str.split(',')])
df = df.stack(level=0) # Index level won't have a name.