绘制具有数百万行的列

Question

我有一个包含数百万行（将近 800 万）的数据框。我需要查看其中一列中值的分布。此列称为 'price_per_mile'。我还有一个名为 'Borough' 的专栏。最终目标是进行 t 检验。首先我想看看'price_per_mile'中的数据分布情况，看看数据是否正常，是否需要做一些数据清洗。然后根据 'borough' 列中的五个类别进行分组，然后对每对可能的自治市镇进行 t 检验。

我试图用 sns.distplot() 绘制分布图，但它没有给我一个清晰的图，因为它似乎在 y 轴上缩放了值。此外，'price_per_mile' 中包含的值的范围很大。

然后我尝试绘制一部分值，但该图看起来不够清晰和信息量不足。缩放再次发生。

result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)

我需要做什么才能得到一个更好看的图，让我看到每个 bin 的真实值，而不仅仅是一个标准化值？我阅读了 sns.distplot() 的文档，但没有找到有用的信息。

Answer 1

根据the documentation for displot（强调我的）

norm_hist : bool, optional

If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.

这意味着如果你想要非标准化直方图，你必须确保指示seaborn不要同时绘制KDE

sns.distplot(a, kde=True, norm_hist=False)

sns.distplot(a, kde=False, norm_hist=False)

绘制具有数百万行的列

Plotting a column with millions of rows

python

scaling

matplotlib

python-3.x

seaborn