识别数据框的分类列
Identifying the categorical columns of a dataframe
我正在尝试识别数据集的分类列,以便将它们转换为数值列。我已经查看了 , this, and ,但我似乎还是做错了什么。
已编辑
我的代码:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
# Read the Churn data into a dataset (pandas) from the cvs file
dataset = pd.read_csv(r'C:\Users\Amalie\IdeaProjects\INFO284\src\Lab2.csv')
print(dataset.head())
# Remove missing values (NaN's) from the dataset
ds = dataset.dropna()
columns = ds.columns.tolist()
# print(ds.dtypes())
print("\nColumns: {}".format(columns))
# Numerical columns
numericCols = ds._get_numeric_data().columns
print("Numerical: {}".format(numericCols)) # 'SeniorCitizen', 'tenure', 'MonthlyCharges'
# Categorical columns
categorical = ds.select_dtypes(include=['category'])
print("Categorical: {}".format(categorical))
y = ds['Churn'] # Target
X = ds.drop('Churn', 1) # Features ( all other than target column 'Churn' )
# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20) # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logReg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logReg.score(X_test, y_test)))
它给了我这个输出:
customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 ... 29.85 29.85 No
1 5575-GNVDE Male 0 ... 56.95 1889.5 No
2 3668-QPYBK Male 0 ... 53.85 108.15 Yes
3 7795-CFOCW Male 0 ... 42.30 1840.75 No
4 9237-HQITU Female 0 ... 70.70 151.65 Yes
[5 rows x 21 columns]
C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py:26: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
X = ds.drop('Churn', 1) # Features ( all other than target column 'Churn' )
Columns: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
Numerical: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')
Categorical: Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[7043 rows x 0 columns]
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
estimator=estimator,
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'
Process finished with exit code 1
这意味着我在 categorical = ds.select_dtypes(include=['category'])
这一行得到了一个空数据框,但我知道那里有分类列,因为当我尝试使用 fit()
方法执行 do 时出现错误逻辑回归。
像这样:
# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20) # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
我得到的错误:
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
estimator=estimator,
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'
如果我尝试在第 14 行包含 print(ds.dtypes())
,我会得到以下输出:
customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 ... 29.85 29.85 No
1 5575-GNVDE Male 0 ... 56.95 1889.5 No
2 3668-QPYBK Male 0 ... 53.85 108.15 Yes
3 7795-CFOCW Male 0 ... 42.30 1840.75 No
4 9237-HQITU Female 0 ... 70.70 151.65 Yes
[5 rows x 21 columns]
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 14, in <module>
print(ds.dtypes())
TypeError: 'Series' object is not callable
Process finished with exit code 1
我该如何解决这个问题?我究竟做错了什么?我只想做逻辑回归,但我似乎停留在组织数据的第一步。
您的独立特征包括分类数据。出现错误是因为您在字符串中有一些列并且不能将其解释为 float 来训练模型。
我的建议是使用get_dummies
。
此示例可能对您有所帮助:
import pandas as pd
r = pd.DataFrame(['France','Japan','Spain','France','USA'],columns= ['Country'])
r['gendor'] = ['male','female','female','female','male']
r = pd.get_dummies(r)
r.head()
Country_France Country_Japan ... gendor_female gendor_male
0 1 0 ... 0 1
1 0 1 ... 1 0
2 0 0 ... 1 0
3 1 0 ... 1 0
4 0 0 ... 0 1
[5 rows x 6 columns]
>>>
所有分类列都使用热标签编码自动转换。
转换分类数据后,您就可以拟合 LogisticRegression。
我正在尝试识别数据集的分类列,以便将它们转换为数值列。我已经查看了
已编辑
我的代码:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
# Read the Churn data into a dataset (pandas) from the cvs file
dataset = pd.read_csv(r'C:\Users\Amalie\IdeaProjects\INFO284\src\Lab2.csv')
print(dataset.head())
# Remove missing values (NaN's) from the dataset
ds = dataset.dropna()
columns = ds.columns.tolist()
# print(ds.dtypes())
print("\nColumns: {}".format(columns))
# Numerical columns
numericCols = ds._get_numeric_data().columns
print("Numerical: {}".format(numericCols)) # 'SeniorCitizen', 'tenure', 'MonthlyCharges'
# Categorical columns
categorical = ds.select_dtypes(include=['category'])
print("Categorical: {}".format(categorical))
y = ds['Churn'] # Target
X = ds.drop('Churn', 1) # Features ( all other than target column 'Churn' )
# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20) # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logReg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logReg.score(X_test, y_test)))
它给了我这个输出:
customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 ... 29.85 29.85 No
1 5575-GNVDE Male 0 ... 56.95 1889.5 No
2 3668-QPYBK Male 0 ... 53.85 108.15 Yes
3 7795-CFOCW Male 0 ... 42.30 1840.75 No
4 9237-HQITU Female 0 ... 70.70 151.65 Yes
[5 rows x 21 columns]
C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py:26: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
X = ds.drop('Churn', 1) # Features ( all other than target column 'Churn' )
Columns: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
Numerical: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')
Categorical: Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[7043 rows x 0 columns]
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
estimator=estimator,
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'
Process finished with exit code 1
这意味着我在 categorical = ds.select_dtypes(include=['category'])
这一行得到了一个空数据框,但我知道那里有分类列,因为当我尝试使用 fit()
方法执行 do 时出现错误逻辑回归。
像这样:
# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20) # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
我得到的错误:
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
estimator=estimator,
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'
如果我尝试在第 14 行包含 print(ds.dtypes())
,我会得到以下输出:
customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 ... 29.85 29.85 No
1 5575-GNVDE Male 0 ... 56.95 1889.5 No
2 3668-QPYBK Male 0 ... 53.85 108.15 Yes
3 7795-CFOCW Male 0 ... 42.30 1840.75 No
4 9237-HQITU Female 0 ... 70.70 151.65 Yes
[5 rows x 21 columns]
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 14, in <module>
print(ds.dtypes())
TypeError: 'Series' object is not callable
Process finished with exit code 1
我该如何解决这个问题?我究竟做错了什么?我只想做逻辑回归,但我似乎停留在组织数据的第一步。
您的独立特征包括分类数据。出现错误是因为您在字符串中有一些列并且不能将其解释为 float 来训练模型。
我的建议是使用get_dummies
。
此示例可能对您有所帮助:
import pandas as pd
r = pd.DataFrame(['France','Japan','Spain','France','USA'],columns= ['Country'])
r['gendor'] = ['male','female','female','female','male']
r = pd.get_dummies(r)
r.head()
Country_France Country_Japan ... gendor_female gendor_male
0 1 0 ... 0 1
1 0 1 ... 1 0
2 0 0 ... 1 0
3 1 0 ... 1 0
4 0 0 ... 0 1
[5 rows x 6 columns]
>>>
所有分类列都使用热标签编码自动转换。
转换分类数据后,您就可以拟合 LogisticRegression。