过滤分类分箱变量 pandas
Filtering on a categorical binned variable pandas
我有一个名为 stroke_data_complete
的数据框,我们在其中使用以下代码对变量进行分箱;
#Cut into 4 bins of equal frequency counts
stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
stroke_data_complete['glucose_level_quartile'].value_counts();
当我检查这个新列的数据类型时;
stroke_data_complete['glucose_level_quartile'].dtypes
我们得到
CategoricalDtype(categories=[(55.119, 77.245], (77.245, 91.885], (91.885, 114.09], (114.09, 271.74]],
ordered=True)
接下来,我必须过滤这个新变量的值之一,这是我的代码;
stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==(114.09, 271.74]]
但我收到以下错误;
SyntaxError: closing parenthesis ']' does not match opening parenthesis '(
如果我在过滤时用引号括起来,我得到的是空输出。关于如何过滤这个新定义的分箱变量,我能得到一些帮助吗?谢谢
试试这个:
stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4, labels=False)
stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==3]
labels=False
确保该列包含四分位数的索引,而不是值。
编辑
没有 labels=False
,qcut
returns 一个分类 Series
。底层数组是 CategoricalArray
。该数组可通过 Series.array
属性访问,其 API 被赋予 here
在你的例子中:
quartiles = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
quartiles = quartiles.array
stroke_data_q_3 = stroke_data_complete.loc[quartiles.codes == 3]
avg_glucose_level_interval_q_3 = quartiles.categories[3]
希望对您有所帮助
我有一个名为 stroke_data_complete
的数据框,我们在其中使用以下代码对变量进行分箱;
#Cut into 4 bins of equal frequency counts
stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
stroke_data_complete['glucose_level_quartile'].value_counts();
当我检查这个新列的数据类型时;
stroke_data_complete['glucose_level_quartile'].dtypes
我们得到
CategoricalDtype(categories=[(55.119, 77.245], (77.245, 91.885], (91.885, 114.09], (114.09, 271.74]],
ordered=True)
接下来,我必须过滤这个新变量的值之一,这是我的代码;
stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==(114.09, 271.74]]
但我收到以下错误;
SyntaxError: closing parenthesis ']' does not match opening parenthesis '(
如果我在过滤时用引号括起来,我得到的是空输出。关于如何过滤这个新定义的分箱变量,我能得到一些帮助吗?谢谢
试试这个:
stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4, labels=False)
stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==3]
labels=False
确保该列包含四分位数的索引,而不是值。
编辑
没有 labels=False
,qcut
returns 一个分类 Series
。底层数组是 CategoricalArray
。该数组可通过 Series.array
属性访问,其 API 被赋予 here
在你的例子中:
quartiles = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
quartiles = quartiles.array
stroke_data_q_3 = stroke_data_complete.loc[quartiles.codes == 3]
avg_glucose_level_interval_q_3 = quartiles.categories[3]
希望对您有所帮助