可视化时间序列热图中的缺失值

Question

我真的是大数据分析新手。假设我有一个具有以下特征的大数据。我想可视化特定小时内每个 id 的燃料参数缺失值（None 值）的百分比。我想绘制一个图表，x 轴是时间序列（时间列），y 轴是 'id'，颜色将指示其缺少的燃料百分比。我根据 'id' 和 'hour'

对数据库进行了分组

我不知道如何以一种好的方式为所有 ID 可视化缺失值。例如，如果特定时间特定 id 的燃料缺失值百分比为 100%，则该特定时间和 'id' 的颜色可以为灰色。如果燃料中缺失值的百分比为 50%，则颜色可以是浅绿色。如果燃料中缺失值的百分比为 0%，则颜色可以是深绿色。根据id和时间分组后，颜色必须基于燃料中缺失值的百分比。

    id    time                   fuel
0   1     2022-02-26 19:08:33    100
2   1     2022-02-26 20:09:35    None
3   2     2022-02-26 21:09:35    70
4   3     2022-02-26 21:10:55    60
5   4     2022-02-26 21:10:55    None
6   5     2022-02-26 22:12:43    50
7   6     2022-02-26 23:10:50    None

例如，在下面的代码中，我计算了特定 id 每小时缺失值的百分比：

df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)

有什么解决办法吗？

Answer 1

关于缺失值可视化没有正确答案，我想这取决于你的使用，习惯......

但首先，要使其正常工作，我们需要预处理您的数据框并使其可分析，也就是确保其数据类型。

首先让我们构建我们的数据：

import pandas as pd
from io import StringIO
    
csvfile = StringIO(
"""id   time    fuel
1   2022-02-26 19:08:33 100
2   2022-02-26 19:09:35 70
3   2022-02-26 19:10:55 60
4   2022-02-26 20:10:55 None
5   2022-02-26 21:12:43 50
6   2022-02-26 22:10:50 None""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')

df
Out[65]: 
   id                 time  fuel
0   1  2022-02-26 19:08:33   100
1   2  2022-02-26 19:09:35    70
2   3  2022-02-26 19:10:55    60
3   4  2022-02-26 20:10:55  None
4   5  2022-02-26 21:12:43    50
5   6  2022-02-26 22:10:50  None

在这个阶段，我们数据框中的几乎所有数据都是字符串相关的，您需要将燃料和时间转换为 non-object dtypes。

df.dtypes
Out[66]: 
id       int64
time    object
fuel    object
dtype: object

时间应转换为日期时间，id 应转换为 int，fuel 应转换为 float。实际上，None 应该转换为数值的 np.nan，这需要 float dtype。

使用地图，我们可以轻松地将所有 'None' 值更改为 np.nan。我不会在这里更深入，但为了简单起见，我将使用带有 __missing__ 实现

的 dict 的自定义子类

df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")

class dict_with_missing(dict):
    def __missing__(self, key):
        return key
map_dict = dict_with_missing({'None' : np.nan})
df.fuel = df.fuel.map(map_dict).astype(np.float32)

然后我们就有了一个干净的数据框：

df
Out[68]: 
   id                time   fuel
0   1 2022-02-26 19:08:33  100.0
1   2 2022-02-26 19:09:35   70.0
2   3 2022-02-26 19:10:55   60.0
3   4 2022-02-26 20:10:55    NaN
4   5 2022-02-26 21:12:43   50.0
5   6 2022-02-26 22:10:50    NaN

df.dtypes
Out[69]: 
id               int64
time    datetime64[ns]
fuel           float32
dtype: object

然后，您可以轻松地使用 missingno 模块中的 bar、matrix 或 heatmap

msno.bar(df)
msno.matrix(df, sparkline=False)
msno.heatmap(df, cmap="RdYlGn")

这里的旁注，热图在这里有点用处，因为它比较具有缺失值的列。而且您只有一列缺少值。但是对于更大的数据框（~ 5/6 列缺少值）它可能很有用。

为了快速直观地可视化，您还可以打印缺失值的数量（又名 np.nan，在 pandas/numpy 公式中）：

df.isna().sum()
Out[72]: 
id      0
time    0
fuel    2
dtype: int64

Answer 2

更新： 热图现在绘制了 id vs time vs 空百分比 fuel。在 post.

的结尾，我保留了 id vs time vs fuel 的原始答案

I want something almost like a github style calendar.

要模仿 GitHub 贡献矩阵，将分组的空值百分比重置为数据框，并 pivot into 1 id per row and 1 hour per column. Then use sns.heatmap 根据空值百分比为每个单元格着色 fuel。

# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# compute null percentage per (id, hour)
nulls = (df.set_index('time')
           .groupby(['id', pd.Grouper(freq='H')])['fuel']
           .apply(lambda x: x.isnull().mean() * 100))

# pivot into id vs time matrix
matrix = (nulls.reset_index(name='null (%)')
               .pivot(index='id', columns='time', values='null (%)'))

# plot time series heatmap
sns.heatmap(matrix, square=True, vmin=0, vmax=100, cmap='magma_r', cbar_kws={'label': 'null (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))

原文：这是为了可视化id by time by fuel:

转换为 id 与 time 矩阵。通常 pivot is fine, but since your real data contains duplicate indexes, use pivot_table.
resample 将 time 列转换为每小时均值。
使用 sns.heatmap 绘制时间序列矩阵。

# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# pivot into id vs time matrix
matrix = df.pivot_table(index='id', columns='time', values='fuel', dropna=False)

# resample columns into hourly means
matrix = matrix.resample('H', axis=1).mean()

# plot time series heatmap
sns.heatmap(matrix, square=True, cmap='plasma_r', vmin=0, vmax=100, cbar_kws={'label': 'fuel (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))

可视化时间序列热图中的缺失值

Visualise missing values in a time series heatmap

python

heatmap

missing-data

dataframe

pandas