pandas: pandas.DataFrame.describe returns 只有一栏的信息
pandas: pandas.DataFrame.describe returns information on only one column
对于某个 Kaggle 数据集(规则禁止我在这里共享数据,但很容易访问 here),
import pandas
df_train = pandas.read_csv(
"01 - Data/act_train.csv.zip"
)
df_train.describe()
我得到:
>>> df_train.describe()
outcome
count 2.197291e+06
mean 4.439544e-01
std 4.968491e-01
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 1.000000e+00
max 1.000000e+00
而对于相同的数据集 df_train.columns
给我:
>>> df_train.columns
Index(['people_id', 'activity_id', 'date', 'activity_category', 'char_1',
'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8',
'char_9', 'char_10', 'outcome'],
dtype='object')
和 df_train.dtypes
给我:
>>> df_train.dtypes
people_id object
activity_id object
date object
activity_category object
char_1 object
char_2 object
char_3 object
char_4 object
char_5 object
char_6 object
char_7 object
char_8 object
char_9 object
char_10 object
outcome int64
dtype: object
我是否遗漏了为什么 pandas 在数据集中只有 describe
一列的原因?
默认情况下,describe
仅适用于数字 dtype 列。添加关键字参数 include='all'
。 From the documentation:
If include is the string ‘all’, the output column-set will match the
input one.
澄清一下,describe
的默认参数是 include=None, exclude=None
。结果的行为是:
None to both (default). The result will include only numeric-typed
columns or, if none are, only categorical columns.
此外,来自 注释 部分:
The output DataFrame index depends on the requested dtypes:
For numeric dtypes, it will include: count, mean, std, min, max, and
lower, 50, and upper percentiles.
For object dtypes (e.g. timestamps or strings), the index will include
the count, unique, most common, and frequency of the most common.
Timestamps also include the first and last items.
试试下面的代码
import pandas
df_train = pandas.read_csv(
"01 - Data/act_train.csv.zip"
)
def describe_categorical(df_train):
from Ipython.display import display, HTML
display (HTML(df_train[df_train.columns[df_train.dtypes=="object"]].describe().to_html()))
describe_categorical(df_train)
对于某个 Kaggle 数据集(规则禁止我在这里共享数据,但很容易访问 here),
import pandas
df_train = pandas.read_csv(
"01 - Data/act_train.csv.zip"
)
df_train.describe()
我得到:
>>> df_train.describe()
outcome
count 2.197291e+06
mean 4.439544e-01
std 4.968491e-01
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 1.000000e+00
max 1.000000e+00
而对于相同的数据集 df_train.columns
给我:
>>> df_train.columns
Index(['people_id', 'activity_id', 'date', 'activity_category', 'char_1',
'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8',
'char_9', 'char_10', 'outcome'],
dtype='object')
和 df_train.dtypes
给我:
>>> df_train.dtypes
people_id object
activity_id object
date object
activity_category object
char_1 object
char_2 object
char_3 object
char_4 object
char_5 object
char_6 object
char_7 object
char_8 object
char_9 object
char_10 object
outcome int64
dtype: object
我是否遗漏了为什么 pandas 在数据集中只有 describe
一列的原因?
默认情况下,describe
仅适用于数字 dtype 列。添加关键字参数 include='all'
。 From the documentation:
If include is the string ‘all’, the output column-set will match the input one.
澄清一下,describe
的默认参数是 include=None, exclude=None
。结果的行为是:
None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
此外,来自 注释 部分:
The output DataFrame index depends on the requested dtypes:
For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.
For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.
试试下面的代码
import pandas
df_train = pandas.read_csv(
"01 - Data/act_train.csv.zip"
)
def describe_categorical(df_train):
from Ipython.display import display, HTML
display (HTML(df_train[df_train.columns[df_train.dtypes=="object"]].describe().to_html()))
describe_categorical(df_train)