Pandas 基于两个或多个二进制列的熔化数据

Pandas melt data based on two or more binary columns

我有一个看起来像这样的数据框,其中包含来自多个交易所的 price sidevolume 参数。

df = pd.DataFrame({
    'price_ex1' : [9380.59650, 9394.85206, 9397.80000],
    'side_ex1' : ['bid', 'bid', 'ask'],
    'size_ex1' : [0.416, 0.053, 0.023],
    'price_ex2' : [9437.24045, 9487.81185, 9497.81424],
    'side_ex2' : ['bid', 'bid', 'ask'],
    'size_ex2' : [10.0, 556.0, 23.0]
})

df


       price_ex1     side_ex1      size_ex1  price_ex2   side_ex2    size_ex2
0     9380.59650          bid         0.416  9437.24045       bid        10.0
1     9394.85206          bid         0.053  9487.81185       bid       556.0
2     9397.80000          ask         0.023  9497.81424       ask        23.0

对于每个交易所(我有两个以上的交易所),我希望指数是所有交易所所有价格的并集(即 price_ex1price_ex2 等的并集。 .) 从高到低排名。然后我想根据该交易所的 side 参数为每​​个交易所创建两个 size 列。输出应如下所示,其中空列为 NaN.

我不确定执行此操作的最佳 pandas 函数是什么,它是 pivot 还是 melt 以及当我要展平超过 1 个二进制列时如何使用该函数。

感谢您的帮助!

你可以尝试这样的事情。

请用您向我们展示的数据制作一个数据框并命名为'example.csv'

   price_ex1     side_ex1      size_ex1  price_ex2   side_ex2    size_ex2
import pandas as pd
import numpy as np

df = pd.read_csv('example.csv')

df1 = df[['price_ex1','side_ex1','size_ex1']]
df2 = df[['price_ex2','side_ex2','size_ex2']]

df3 = df1.append(df2)
df4 = df3[['price_ex1','price_ex2']]
arr = df4.values
df3['price_ex1'] = arr[~np.isnan(arr)].astype(float)
df3.drop(columns=['price_ex2'], inplace=True)
df3.columns = ['price', 'bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2']

def change(bid_ex1, ask_ex1, bid_ex2, ask_ex2, col_name):
    if col_name == 'bid_ex1_col':
        if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'bid':
            return bid_ex2
        else: 
            return bid_ex1

    if col_name == 'ask_ex1_col':
        if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'ask':
            return bid_ex2   
        else: 
            return ask_ex1

    if col_name == 'ask_ex2_col':
        if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'ask':
            return ask_ex2   
        else: 
            return ask_ex1

    if col_name == 'bid_ex2_col':
        if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'bid':
            return ask_ex2   
        else: 
            return ask_ex1

df3['bid_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex1_col'), axis=1)
df3['ask_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex1_col'), axis=1)

df3['ask_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex2_col'), axis=1)
df3['bid_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex2_col'), axis=1)

df3.drop(columns=['bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2'], inplace=True)

df3.replace(to_replace='ask', value=np.nan,inplace=True)
df3.replace(to_replace='bid', value=np.nan,inplace=True)

这是一个三步过程。更正多索引列后,您应该堆叠数据集,然后旋转它。

首先,清理多索引列,以便您更轻松地转换:

df.columns = pd.MultiIndex.from_product([['1', '2'], [col[:-4] for col in df.columns[:3]]], names=['exchange', 'params'])

exchange           1                       2            
params         price side   size       price side   size
0         9380.59650  bid  0.416  9437.24045  bid   10.0
1         9394.85206  bid  0.053  9487.81185  bid  556.0
2         9397.80000  ask  0.023  9497.81424  ask   23.0

然后堆叠并将交换编号附加到 bidask 值:

df = df.swaplevel(axis=1).stack()
df['side'] = df.apply(lambda row: row.side + '_ex' + row.name[1], axis=1)       

params           price     side     size
  exchange                              
0 1         9380.59650  bid_ex1    0.416
  2         9437.24045  bid_ex2   10.000
1 1         9394.85206  bid_ex1    0.053
  2         9487.81185  bid_ex2  556.000
2 1         9397.80000  ask_ex1    0.023
  2         9497.81424  ask_ex2   23.000

最后,按价格进行透视和排序:

df.pivot_table(index=['price'], values=['size'], columns=['side']).sort_values('price', ascending=False) 

params        size                        
side       ask_ex1 ask_ex2 bid_ex1 bid_ex2
price                                     
9497.81424     NaN    23.0     NaN     NaN
9487.81185     NaN     NaN     NaN   556.0
9437.24045     NaN     NaN     NaN    10.0
9397.80000   0.023     NaN     NaN     NaN
9394.85206     NaN     NaN   0.053     NaN
9380.59650     NaN     NaN   0.416     NaN

一种选择是使用 pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:

翻转为长格式
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(names_to = ('ex1', 'ex2', 'ex'), 
              values_to=('price','side','size'), 
              names_pattern=['price', 'side', 'size'])
.loc[:, ['price', 'side','ex','size']]
.assign(ex = lambda df: df.ex.str.split('_').str[-1])
.pivot_wider('price', ('side', 'ex'), 'size')
.sort_values('price', ascending = False)
)

        price  bid_ex1  ask_ex1  bid_ex2  ask_ex2
5  9497.81424      NaN      NaN      NaN     23.0
4  9487.81185      NaN      NaN    556.0      NaN
3  9437.24045      NaN      NaN     10.0      NaN
2  9397.80000      NaN    0.023      NaN      NaN
1  9394.85206    0.053      NaN      NaN      NaN
0  9380.59650    0.416      NaN      NaN      NaN