过滤分类分箱变量 pandas

Question

我有一个名为 stroke_data_complete 的数据框，我们在其中使用以下代码对变量进行分箱；

#Cut into 4 bins of equal frequency counts
stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
stroke_data_complete['glucose_level_quartile'].value_counts();

当我检查这个新列的数据类型时；

stroke_data_complete['glucose_level_quartile'].dtypes

我们得到

CategoricalDtype(categories=[(55.119, 77.245], (77.245, 91.885], (91.885, 114.09], (114.09, 271.74]],
          ordered=True)

接下来，我必须过滤这个新变量的值之一，这是我的代码；

stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==(114.09, 271.74]]

但我收到以下错误；

SyntaxError: closing parenthesis ']' does not match opening parenthesis '(

如果我在过滤时用引号括起来，我得到的是空输出。关于如何过滤这个新定义的分箱变量，我能得到一些帮助吗？谢谢

Answer 1

试试这个：

stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4, labels=False)
stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==3]

labels=False 确保该列包含四分位数的索引，而不是值。

编辑

没有 labels=False，qcut returns 一个分类 Series。底层数组是 CategoricalArray。该数组可通过 Series.array 属性访问，其 API 被赋予 here

在你的例子中：

quartiles = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
quartiles = quartiles.array
stroke_data_q_3 = stroke_data_complete.loc[quartiles.codes == 3]
avg_glucose_level_interval_q_3 = quartiles.categories[3]

希望对您有所帮助

过滤分类分箱变量 pandas

Filtering on a categorical binned variable pandas

binning

python-3.x

编辑