Python 中如何定义分类变量的水平？

Question

我知道逻辑回归使用 0 和 1 作为因变量。但是当变量定义为类别时，0 和 1 是如何分配的，"Healthy" 与 "Sick"？换句话说，参考水平是多少？ "Healthy" 是否因为 H 在字母表中排在首位而被赋予 0？

Testing CSV

import pandas as pd
import numpy as np
import os
from sklearn.model_selection import RepeatedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

# index_col=0 eliminates the dumb index column
baseball_train = pd.read_csv(r"baseball_train.csv",index_col=0,
                             dtype={'Opp': 'category', 'Result': 'category', 
                                    'Name': 'category'}, header=0)
baseball_test = pd.read_csv(r"baseball_test.csv",index_col=0,
                            dtype={'Opp': 'category', 'Result': 'category', 
                                   'Name': 'category'}, header=0)

# take all independent variables
X = baseball_train.iloc[:,:-1]
# drop opp and result because I don't want them
X = X.drop(['Opp','Result'],axis=1)
# dependent variable
y = baseball_train.iloc[:,-1]

# Create logistic regression
logit = LogisticRegression(fit_intercept=True)

model = logit.fit(X,y)

此处，Name 是具有类别的因变量："Nolan" 和 "Tom" 而不是 0s 和 1s

Answer 1

你只需要知道如何先验解释 1 和 0。

以下教程通过一个很好的示例说明了如何使用分类数据：https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

Answer 2

如果您使用 Pandas 读取和编码您的数据，categories 会被排序（就像 sklearn，见下文）。

import pandas as pd
import io

txt = """
HR,HBP,Name
0,0,Tommy
0,1,Nolan
0,2,Tommy
1,1,Nolan"""

df = pd.read_csv(io.StringIO(txt), dtype={'Name': 'category'})
print(df)

  HR  HBP Name
0 0   0   Tommy
1 0   1   Nolan
2 0   2   Tommy
3 1   1   Nolan

如果您查看代码，您会发现虽然首先提到了 Tommy，但它的编码是 1 而 Nolan 得到的是 0。

print(df.Name.cat.codes)

0    1
1    0
2    1
3    0
dtype: int8

如果您想将所有内容都作为字典获取：

encoded_categories = dict(enumerate(df.Name.cat.categories))
print(encoded_categories)

{0: 'Nolan', 1: 'Tommy'}

初步回答

您用 scikit-learn 标记了问题，因此我假设您使用的是 sklearn.preprocessing 中的 LabelEncoder。在这种情况下，值确实已排序。

简单的例子

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])

fit 调用 _encode 在 Python list 或 tuple （或除 numpy 数组以外的任何东西）的情况下对其进行排序编码。 numpy 数组也使用 numpy.unique.

进行排序

您可以通过

查看

print(le.classes_)
>> ['amsterdam' 'paris' 'tokyo']

所以在你的情况下

np.array_equal(le.fit(["healthy", "sick"]).classes_, 
               le.fit(["sick", "healthy"]).classes_)
>> True

np.array_equal(le.fit(["healthy", "sick"]).classes_, 
               le.fit(["sick", "healthy", "unknown"]).classes_)
>> False

Python 中如何定义分类变量的水平？

How are levels of a categorical variable defined in Python?

python

scikit-learn

logistic-regression