如何从现有列成功创建有序分类数据？

Question

这是一个数据框 - 'dist_copy' - 我想将其值更改为分类值。我只包括了一列，但我还想转换其他列。

state	dist_id	pct_free_reduced_lunch
Illinois	1111	80% - 100%
Illinois	1112	0 - 20%
Illinois	2365	40% - 60%

dist_copy.pct_free_reduced_lunch.unique()

returns

array(['80% - 100%', '60% - 80%', '0 - 20%', '20% - 40%', '40% - 60%'], dtype=object)

之前，我使用pd.Categorical将'pct_free_reduced_lunch'列中的所有值更改为'categorical'，并建立了顺序，代码如下：

dist_copy['pct_free_reduced_lunch'] = pd.Categorical(dist_copy['pct_free_reduced_lunch'], 
     categories=['0 — 20%','20% — 40%', '40% — 60%', '60% — 80%',  '80% - 100%'], ordered=True)

今天，此代码不起作用，仅保留第一个值，将所有其他值更改为 NaN。

state	dist_id	pct_free_reduced_lunch
Illinois	1111	80% - 100%
Illinois	1112	NaN
Illinois	2365	NaN

我哪里做错了，或者理解错了？

更新：在我将 unique() 返回的数组中的每个分类值按所需顺序复制粘贴到 pd.Categorical 函数内的类别数组后，上面的代码开始工作。

当我只是从头开始输入它们时，就创建了 NaN。

为什么？我真的很想知道！

Answer 1

您在 categories 参数中使用破折号而不是连字符，除了 '80% - 100%'。但是在数据中只有连字符，因此除 '80% - 100%' 之外的所有连字符都变成 NaN.

尝试使用 categories=sorted(dist_copy.pct_free_reduced_lunch.unique()) 来避免这种打字错误。

如何从现有列成功创建有序分类数据？

How do I successfully create ordered categorical data from an existing column?

python

pandas

categorical-data