imputer = imputer.fit(X[:,1:3])的X代表什么，imputer.fit(X[:,1:3])是什么意思？

Question

我正在对数据集进行预处理，我得到了该行的错误原因 imputer = imputer.fit(X[:,1:3])。哪个我不明白？我理解 imputer = Imputer(missing_values = "NaN", strategy = "mean"), 意味着用列和行中的平均值替换缺失值。那我们是不是在尝试将数据拟合到模型中，这是我不明白的？


import pandas as pd 
from sklearn import svm
import matplotlib.pylot as plt %matplotlib inline

from sklearn.preprocessing import Imputer
import seaborn as sns; sns.set(font_scale=1.2)

stock=pd.read_csv("C:/Users/Dulangi/Downloads/winequality-red.csv")
stock.head()

g=sns.lmplot('alcohol','quality',data=stock,height=7, truncate=True, scatter_kws={"s":100})
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)

imputer = imputer.fit(X[:,1:3])

我得到的错误


NameError                                 Traceback (most recent call last)
<ipython-input-4-620c08822929> in <module>
     14 imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
     15 
---> 16 imputer = imputer.fit(X[:,1:3])

NameError: name 'X' is not defined

NameError                                 Traceback (most recent call last)
<ipython-input-4-620c08822929> in <module>
     14 imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
     15 
---> 16 imputer = imputer.fit(X[:,1:3])

NameError: name 'X' is not defined

Answer 1

什么意思：
+ SimpleImputer.fit(X_train),
+ SimpleImputer.transform(X_valid) 或 SimpleImputer.transform(X_test)?

让我先尝试回答这个问题：

输入法基本上是找到缺失值，然后根据策略替换它们。如您所见，在下面的代码示例中，我使用了 strategy=mean，这意味着给定一个数据 X_train，您可以在其中的每一列中找到 mean，然后替换这些 mean 为各个列计算的值缺失值。

现在，一旦您执行了 SimpleImputer.fit(X_train)，您就已经有了这些用于插补的 mean 值。接下来，当您应用 SimpleImputer.transform(X_test) 时，您实际上还通过先前计算的 mean 值来估算缺失值。

技术方案

您似乎正在尝试从 sklearn.preprocessing 导入 Imputer。根据 sklearn 版本 0.21.3 的 documentation，没有 sklearn.preprocessing.Imputer 这样的东西。

改为使用：

sklearn.impute.SimpleImputer

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# Imputation
my_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

一些有用的资源：

我鼓励您查看这些资源。

Answer 2

通过如下定义X，问题得到解决，也可以引入y进行插补 X= stock.iloc[:,0:5]. 值 y= stock.iloc[:,5].值

Answer 3

我们使用 sci-kit 库中的 imputer，即填充缺失值，我们使用数据集中所考虑列的均值或众数来填充缺失值。

在[:,1:3]中，逗号前的左边表示到select数据集中的所有行，你甚至可以指定一个范围到select的行作为而不是：假设我们说 1:10，那么它是 select 的前 10 行。

逗号后的右侧表示select前3列，从1:3开始，我们甚至可以说：表示select所有列。

然后 fit 实际上存储在训练数据集上计算的均值或模式值，使用我们分配的策略来填充缺失值，然后在转换期间将其用于测试数据。

参考这些以获得更好的想法

https://www.youtube.com/watch?v=fCMrO_VzeL8&t=515s

https://www.youtube.com/watch?v=oH3wYKvwpJ8&t=1s

https://medium.com/@kanchanardj/jargon-in-python-used-in-data-science-to-laymans-language-part-two-98787cce0928

Answer 4

您必须将数据集值分配给 X 变量，如下所述。

如果您是运行内核，请确保变量值不应在内核中重置..

import pandas as pd 
from sklearn import svm
import matplotlib.pylot as plt %matplotlib inline

from sklearn.preprocessing import Imputer
import seaborn as sns; sns.set(font_scale=1.2)

stock=pd.read_csv("C:/Users/Dulangi/Downloads/winequality-red.csv")
stock.head()

g=sns.lmplot('alcohol','quality',data=stock,height=7, truncate=True, scatter_kws={"s":100})
X = stock.iloc[ : , :-1].values 
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:,1:3])

imputer = imputer.fit(X[:,1:3])的X代表什么，imputer.fit(X[:,1:3])是什么意思？

Whats does X of imputer = imputer.fit(X[:,1:3]) stand for, whats the meaning of imputer.fit(X[:,1:3])?

python-3.x

pandas

data-science

sklearn-pandas

技术方案

改为使用：

一些有用的资源：