Python: ValueError: could not convert string to float: 'Isolated' when reading input file for applying random forest

Python: ValueError: could not convert string to float: 'Isolated' when reading input file for applying random forest

我正在尝试将随机森林应用于以下输入文件:

gold,Program,Requirement,MethodType,Top,Side,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU
T,chess,1,Inner,T,T,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,
N,chess,2,Inner,N,N,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,
N,chess,3,Inner,N,N,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,
N,chess,4,Root,N,N,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,
N,chess,5,Inner,N,N,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,
N,chess,6,Root,N,N,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,
N,chess,7,Inner,N,N,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,
N,chess,8,Inner,N,N,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,
N,chess,1,Leaf,NU,N,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,
N,chess,2,Leaf,NU,N,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,
N,chess,3,Leaf,NU,N,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,
N,chess,4,Root,NU,N,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,
N,chess,5,Isolated,NU,N,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,
T,chess,6,Inner,TU,T,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,
T,chess,7,Isolated,TU,T,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,
N,chess,8,Inner,NU,N,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,
N,chess,1,Inner,TNU,N,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,
N,chess,2,Inner,NU,N,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,
N,chess,3,Inner,NU,N,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,
T,chess,4,Inner,NU,N,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,
N,chess,5,Leaf,NU,N,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,

这是我用来应用随机森林的代码:

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
# Feature Scaling
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

X_train={}
X_test={}
y_train={}
y_test={}
dataset = pd.read_csv( 'dataExtended2.txt', sep= ',') 
    #convert T into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
dataset['Program'] = dataset['Program'].astype('category').cat.codes
dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes
dataset['Top'] = dataset['Top'].astype('category').cat.codes
dataset['Side'] = dataset['Side'].astype('category').cat.codes
dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes
dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes
dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes
dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes
dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes
dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes
dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes
dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes
dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes
dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes
dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes
dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes
pd.set_option('display.max_columns', None)

print(dataset.head())
row_count, column_count = dataset.shape
   
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
Xcol = dataset.iloc[:, 1:column_count]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
   
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

我在执行代码的最后一行 (X_train = sc.fit_transform(X_train)) 时收到错误:ValueError: could not convert string to float: 'Isolated' 尽管我正在使用代码行:dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes 转换 MethodType 从字符串到浮点数。我该如何解决这个问题?

这是错误的回溯:

Traceback (most recent call last):

  File "<ipython-input-38-d7fe5c294c10>", line 1, in <module>
    runfile('C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python/RandomForestSimplified.py', wdir='C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python')

  File "C:\Users\mouna\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

  File "C:\Users\mouna\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python/RandomForestSimplified.py", line 43, in <module>
    X_train = sc.fit_transform(X_train)

  File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\base.py", line 517, in fit_transform
    return self.fit(X, **fit_params).transform(X)

  File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 590, in fit
    return self.partial_fit(X, y)

  File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 612, in partial_fit
    warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES)

  File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: 'Isolated'

Ok 当您查看代码 (print(dataset.head())) 的输出时,您会看到第一列 'gold' 但这仍然是一个字符串。发生这种情况是因为 pandas 使用第一列作为索引。

     gold  Program Requirement  MethodType  Top  Side  CallersT  CallersN  \
T     0        0       Inner           2    1     1         0         0
N     0        1       Inner           0    0     0         1         0
N     0        2       Inner           0    0     0         1         0
N     0        3        Root           0    0     0         1         0
N     0        4       Inner           0    0     0         1         0

   CallersU  CallersCallersT  CallersCallersN  CallersCallersU  CalleesT  \
T         1                0                0                1         0
N         0                1                0                0         1
N         0                1                0                0         1
N         0                1                0                0         1
N         0                1                0                0         1

   CalleesN  CalleesU  CalleesCalleesT  CalleesCalleesN  CalleesCalleesU
T         0         0                0                1               -1
N         0         0                0                1               -1
N         0         0                0                1               -1
N         0         0                0                1               -1
N         0         0                0                1               -1

解决方案:

dataset = pd.read_csv( 'dataExtended2.txt', sep= ',', index_col=False) 

那么输出将是:

  gold  Program  Requirement  MethodType  Top  Side  CallersT  CallersN  \
0     1        0            1           0    2     1         1         0
1     0        0            2           0    0     0         0         1
2     0        0            3           0    0     0         0         1
3     0        0            4           3    0     0         0         1
4     0        0            5           0    0     0         0         1

   CallersU  CallersCallersT  CallersCallersN  CallersCallersU  CalleesT  \
0         0                1                0                0         1
1         0                0                1                0         0
2         0                0                1                0         0
3         0                0                1                0         0
4         0                0                1                0         0

   CalleesN  CalleesU  CalleesCalleesT  CalleesCalleesN  CalleesCalleesU
0         0         0                0                0                1
1         1         0                0                0                1
2         1         0                0                0                1
3         1         0                0                0                1
4         1         0                0                0                1

pandashttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

的 csv 导入文档中有更多详细信息