pandas: pandas.DataFrame.describe returns 只有一栏的信息

pandas: pandas.DataFrame.describe returns information on only one column

对于某个 Kaggle 数据集(规则禁止我在这里共享数据,但很容易访问 here),

import pandas
df_train = pandas.read_csv(
    "01 - Data/act_train.csv.zip"
)
df_train.describe()

我得到:

>>> df_train.describe()
            outcome
count  2.197291e+06
mean   4.439544e-01
std    4.968491e-01
min    0.000000e+00
25%    0.000000e+00
50%    0.000000e+00
75%    1.000000e+00
max    1.000000e+00

而对于相同的数据集 df_train.columns 给我:

>>> df_train.columns
Index(['people_id', 'activity_id', 'date', 'activity_category', 'char_1',
       'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8',
       'char_9', 'char_10', 'outcome'],
      dtype='object')

df_train.dtypes 给我:

>>> df_train.dtypes
people_id            object
activity_id          object
date                 object
activity_category    object
char_1               object
char_2               object
char_3               object
char_4               object
char_5               object
char_6               object
char_7               object
char_8               object
char_9               object
char_10              object
outcome               int64
dtype: object

我是否遗漏了为什么 pandas 在数据集中只有 describe 一列的原因?

默认情况下,describe 仅适用于数字 dtype 列。添加关键字参数 include='all'From the documentation:

If include is the string ‘all’, the output column-set will match the input one.

澄清一下,describe 的默认参数是 include=None, exclude=None。结果的行为是:

None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.

此外,来自 注释 部分:

The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.

试试下面的代码

import pandas
df_train = pandas.read_csv(
    "01 - Data/act_train.csv.zip"
)

def describe_categorical(df_train):
    from Ipython.display import display, HTML
    display (HTML(df_train[df_train.columns[df_train.dtypes=="object"]].describe().to_html()))

describe_categorical(df_train)