将什么传递给 clf.predict()？

Question

我最近开始玩决策树，我想用一些制造的数据来训练我自己的简单模型。我想使用这个模型来预测一些进一步的模拟数据，只是为了感受一下它是如何工作的，但后来我卡住了。模型训练完成后，如何将数据传递给 predict()？

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

文档状态： clf.predict(X)

参数： X : 形状为 [n_samples, n_features]

的类数组或稀疏矩阵

但是当尝试传递 np.array、np.ndarray、列表、元组或 DataFrame 时，它只会抛出错误。你能帮我理解为什么吗？

代码如下：

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree

pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150

lenght = 50000

miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]

DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})

DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')

target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values

clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)

clf.predict(?????) #### <===== What should go here?

clf.predict([30,4000,1])

ValueError：应为二维数组，得到的是一维数组：数组=[3.e+01 4.e+03 1.e+00]。如果您的数据具有单个特征，则使用 array.reshape(-1, 1) 重塑您的数据，如果它包含单个样本，则使用 array.reshape(1, -1)。

clf.predict(np.array(30,4000,1))

ValueError：只接受了 2 个非关键字参数

Answer 1

您要预测的 "mock data" 在哪里？

您的数据应该与调用 fit() 时使用的数据形状相同。从上面的代码中，我看到你的 X 有三列 ['CommuteInMiles','Salary','FullTimeEmployee']。您需要在预测数据中包含那么多列，行数可以是任意的。

现在当你

clf.predict([30,4000,1])

模型无法理解这些是同一行的列或不同行的数据。

因此您需要将其转换为二维数组，其中内部数组表示单行。

这样做：

clf.predict([[30,4000,1]])     #<== Observe the two square brackets

您可以预测多行，每行都在内部列表中。像这样：

X_test = [[30,4000,1],
          [35,15000,0],
          [40,2000,1],]
clf.predict(X_test)

至于你上次的错误clf.predict(np.array(30,4000,1))，这与predict()无关。您使用的 np.array() 错误。

根据the documentation，np.array的签名是：

(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)

除了第一个 (object)，所有其他的都是关键字参数，因此需要这样使用。但是，当您这样做时：np.array(30,4000,1)，每个值都被视为此处分隔参数的输入：object=30、dtype=4000、copy=1。这是不允许的，因此会出错。如果你想从列表中创建一个 numpy 数组，你需要传递一个列表。

像这样：np.array([30,4000,1]) 现在这将被正确地视为 object 参数的输入。

将什么传递给 clf.predict()？

What to pass to clf.predict()?

classification

python-3.x

scikit-learn

sklearn-pandas