识别数据框的分类列

Question

我正在尝试识别数据集的分类列，以便将它们转换为数值列。我已经查看了 , this, and ，但我似乎还是做错了什么。

已编辑

我的代码：

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC


# Read the Churn data into a dataset (pandas) from the cvs file
dataset = pd.read_csv(r'C:\Users\Amalie\IdeaProjects\INFO284\src\Lab2.csv')
print(dataset.head())

# Remove missing values (NaN's) from the dataset
ds = dataset.dropna()
columns = ds.columns.tolist()
# print(ds.dtypes())
print("\nColumns: {}".format(columns))

# Numerical columns
numericCols = ds._get_numeric_data().columns
print("Numerical: {}".format(numericCols))                  # 'SeniorCitizen', 'tenure', 'MonthlyCharges'

# Categorical columns
categorical = ds.select_dtypes(include=['category'])
print("Categorical: {}".format(categorical))

y = ds['Churn']          # Target
X = ds.drop('Churn', 1)  # Features ( all other than target column 'Churn' )

# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20)  # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logReg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logReg.score(X_test, y_test)))

它给了我这个输出：

   customerID  gender  SeniorCitizen  ... MonthlyCharges TotalCharges  Churn
0  7590-VHVEG  Female              0  ...          29.85        29.85     No
1  5575-GNVDE    Male              0  ...          56.95       1889.5     No
2  3668-QPYBK    Male              0  ...          53.85       108.15    Yes
3  7795-CFOCW    Male              0  ...          42.30      1840.75     No
4  9237-HQITU  Female              0  ...          70.70       151.65    Yes

[5 rows x 21 columns]
C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py:26: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
  X = ds.drop('Churn', 1)  # Features ( all other than target column 'Churn' )

Columns: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
Numerical: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')
Categorical: Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[7043 rows x 0 columns]
Traceback (most recent call last):
  File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
    logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
    estimator=estimator,
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
    return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'

Process finished with exit code 1

这意味着我在 categorical = ds.select_dtypes(include=['category']) 这一行得到了一个空数据框，但我知道那里有分类列，因为当我尝试使用 fit() 方法执行 do 时出现错误逻辑回归。像这样：

# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20)  # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)

我得到的错误：

Traceback (most recent call last):
  File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
    logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
    estimator=estimator,
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
    return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'

如果我尝试在第 14 行包含 print(ds.dtypes())，我会得到以下输出：

   customerID  gender  SeniorCitizen  ... MonthlyCharges TotalCharges  Churn
0  7590-VHVEG  Female              0  ...          29.85        29.85     No
1  5575-GNVDE    Male              0  ...          56.95       1889.5     No
2  3668-QPYBK    Male              0  ...          53.85       108.15    Yes
3  7795-CFOCW    Male              0  ...          42.30      1840.75     No
4  9237-HQITU  Female              0  ...          70.70       151.65    Yes

[5 rows x 21 columns]
Traceback (most recent call last):
  File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 14, in <module>
    print(ds.dtypes())
TypeError: 'Series' object is not callable

Process finished with exit code 1

我该如何解决这个问题？我究竟做错了什么？我只想做逻辑回归，但我似乎停留在组织数据的第一步。

Answer 1

您的独立特征包括分类数据。出现错误是因为您在字符串中有一些列并且不能将其解释为 float 来训练模型。

我的建议是使用get_dummies。

此示例可能对您有所帮助：

import pandas as pd

r = pd.DataFrame(['France','Japan','Spain','France','USA'],columns= ['Country'])
r['gendor'] = ['male','female','female','female','male']
r = pd.get_dummies(r)
r.head()

   Country_France  Country_Japan  ...  gendor_female  gendor_male
0               1              0  ...              0            1
1               0              1  ...              1            0
2               0              0  ...              1            0
3               1              0  ...              1            0
4               0              0  ...              0            1
[5 rows x 6 columns]

>>>

所有分类列都使用热标签编码自动转换。

转换分类数据后，您就可以拟合 LogisticRegression。

识别数据框的分类列

Identifying the categorical columns of a dataframe

python

pandas

scikit-learn

categorical-data

已编辑