多索引数据框导致绘制数据之间的广泛分离

Question

我有以下情节：

我的 pandas 数据集正在使用多索引 pandas，喜欢

下面是我的代码：

ax = plt.gca()

df['adjClose'].plot(ax=ax, figsize=(12,4), rot=9, grid=True, label='price', color='orange')
df['ma5'].plot(ax=ax, label='ma5', color='yellow')
df['ma100'].plot(ax=ax, label='ma100', color='green')

# df.plot.scatter(x=df.index, y='buy')
x = pd.to_datetime(df.unstack(level=0).index, format='%Y/%m/%d')

# plt.scatter(x, df['buy'].values)
ax.scatter(x, y=df['buy'].values, label='buy', marker='^', color='red')
ax.scatter(x, y=df['sell'].values, label='sell', marker='v', color='green')

plt.show()

数据来自`.csv`

symbol,date,close,high,low,open,volume,adjClose,adjHigh,adjLow,adjOpen,adjVolume,divCash,splitFactor,ma5,ma100,buy,sell
601398,2020-01-01 00:00:00+00:00,5.88,5.88,5.88,5.88,0,5.2991971571,5.2991971571,5.2991971571,5.2991971571,0,0.0,1.0,,,,
601398,2020-01-02 00:00:00+00:00,5.97,6.03,5.91,5.92,234949400,5.3803073177,5.4343807581,5.3262338773,5.3352461174,234949400,0.0,1.0,,,,
601398,2020-01-03 00:00:00+00:00,5.99,6.02,5.96,5.97,152213050,5.3983317978,5.425368518,5.3712950777,5.3803073177,152213050,0.0,1.0,,,,
601398,2020-01-06 00:00:00+00:00,5.97,6.05,5.95,5.96,226509710,5.3803073177,5.4524052382,5.3622828376,5.3712950777,226509710,0.0,1.0,,,,

上面的数据是我保存csv后看到的，但是重新加载后，它失去了原来的结构，如下所示

Answer 1

从图中可以看出，问题是前 3 行是根据数据帧索引绘制的，它显示为 tuple。散点图是根据 datetime 值 x 绘制的，这不是 ax 轴上的值，因此它们绘制在最右边。
- - the axis is a bunch of stacked tuples, like
不要将数据帧转换为多索引。 如果您正在做某事，这会创建多重索引，则执行 df.reset_index(level=x, inplace=True) 其中 x 表示 'symbol' 在多重索引中的级别-指数。
- 从索引中删除 'symbol' 后，使用 df.index = pd.to_datetime(df.index).date
据推测，dataframe 中有不止一个 'symbol'，因此应该为每个绘制一个单独的图。
在 pandas 1.3.1、python 3.8 和 matplotlib 3.4.2

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# load the data from the csv
df = pd.read_csv('file.csv')

# convert date to a datetime format and extract only the date component
df.date = pd.to_datetime(df.date).dt.date

# set date as the index
df.set_index('date', inplace=True)

# this is what the dataframe should look like before plotting
            symbol  close  high   low  open     volume  adjClose  adjHigh  adjLow  adjOpen  adjVolume  divCash  splitFactor  ma5  ma100  buy  sell
date                                                                                                                                              
2020-01-01  601398   5.88  5.88  5.88  5.88          0      5.30     5.30    5.30     5.30          0      0.0          1.0  NaN    NaN  NaN   NaN
2020-01-02  601398   5.97  6.03  5.91  5.92  234949400      5.38     5.43    5.33     5.34  234949400      0.0          1.0  NaN    NaN  NaN   NaN
2020-01-03  601398   5.99  6.02  5.96  5.97  152213050      5.40     5.43    5.37     5.38  152213050      0.0          1.0  NaN    NaN  NaN   NaN
2020-01-06  601398   5.97  6.05  5.95  5.96  226509710      5.38     5.45    5.36     5.37  226509710      0.0          1.0  NaN    NaN  NaN   NaN

# extract the unique symbols
symbols = df.symbol.unique()

# get the number of unique symbols
sym_len = len(symbols)

# create a number of subplots based on the number of unique symbols in df
fig, axes = plt.subplots(nrows=sym_len, ncols=1, figsize=(12, 4*sym_len))

# if there's only 1 symbol, axes won't be iterable, so we put it in a list
if type(axes) != np.ndarray:
    axes = [axes]

# iterate through each symbol and plot the relevant data to an axes
for ax, sym in zip(axes, symbols):
    
    # select the data for the relevant symbol
    data = df[df.symbol.eq(sym)]
    
    # plot data
    data[['adjClose', 'ma5', 'ma100']].plot(ax=ax, title=f'Data for Symbol: {sym}', ylabel='Value')
    ax.scatter(data.index, y=data['buy'], label='buy', marker='^', color='red')
    ax.scatter(data.index, y=data['sell'], label='sell', marker='v', color='green')
    ax.legend(bbox_to_anchor=(1, 1.02), loc='upper left')
    
fig.tight_layout()

data.high 和 data.low 被绘制为散点图，因为 data.buy 和 data.sell 在测试数据中是 np.nan。

df 可以方便地创建为：

sample = {'symbol': [601398, 601398, 601398, 601398], 'date': ['2020-01-01 00:00:00+00:00', '2020-01-02 00:00:00+00:00', '2020-01-03 00:00:00+00:00', '2020-01-06 00:00:00+00:00'], 'close': [5.88, 5.97, 5.99, 5.97], 'high': [5.88, 6.03, 6.02, 6.05], 'low': [5.88, 5.91, 5.96, 5.95], 'open': [5.88, 5.92, 5.97, 5.96], 'volume': [0, 234949400, 152213050, 226509710], 'adjClose': [5.2991971571, 5.3803073177, 5.3983317978, 5.3803073177], 'adjHigh': [5.2991971571, 5.4343807581, 5.425368518, 5.4524052382], 'adjLow': [5.2991971571, 5.3262338773, 5.3712950777, 5.3622828376], 'adjOpen': [5.2991971571, 5.3352461174, 5.3803073177, 5.3712950777], 'adjVolume': [0, 234949400, 152213050, 226509710], 'divCash': [0.0, 0.0, 0.0, 0.0], 'splitFactor': [1.0, 1.0, 1.0, 1.0], 'ma5': [np.nan, np.nan, np.nan, np.nan], 'ma100': [np.nan, np.nan, np.nan, np.nan], 'buy': [np.nan, np.nan, np.nan, np.nan], 'sell': [np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(sample)

Answer 2

再想办法解决我的问题：

df = df.unstack(level=0)

这是我测试过的作品

我认为与@Trentons 最后建议的波纹管相似：

df.reset_index(level=0, inplace=True)
df.index = df.index.date

多索引数据框导致绘制数据之间的广泛分离

multi-index dataframe causes wide separation between plotted data

python

plot

scatter

matplotlib

pandas

数据来自.csv

数据来自`.csv`