有没有办法将列表用作 DataFrame 中的值?
Is there a way to use Lists as values in a DataFrame?
我正在处理著名的 Kaggle 挑战 "House prices"。
我想用 sklearn.linear_model LinearRegression
训练我的数据集
阅读以下文章后:
https://developers.google.com/machine-learning/crash-course/representation/feature-engineering
我编写了一个函数,将我的火车 DataFrame 中的所有字符串值转换为列表。
例如,原始特征值可能看起来像这样 [Ex, Gd, Ta, Po],转换后它会像这样:[1,0,0,0] [0,1,0,0] [0, 0,1,0] [0,0,0,1].
当我尝试训练我的数据时,出现以下错误:
Traceback (most recent call last): File
"C:/Users/Owner/PycharmProjects/HousePrices/main.py", line 27, in
linereg.fit(train_df, target) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\linear_model\base.py",
line 458, in fit
y_numeric=True, multi_output=True) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py",
line 756, in check_X_y
estimator=estimator) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py",
line 567, in check_array
array = array.astype(np.float64) ValueError: setting an array element with a sequence.
只有当我按照我的解释转换某些列时才会发生这种情况。
有什么方法可以训练以向量为值的线性回归模型吗?
这是我的转换函数:
def feature_to_boolean_vector(df, feature_name, new_name):
vectors_list = [] #each tuple will represent an option
feature_options = df[feature_name].unique()
feature_options_length = len(feature_options)
# creating a list the size of feature_options_length, all 0's
list_to_be_vector = [0 for i in range(feature_options_length)]
for i in range(feature_options_length):
list_to_be_vector[i] = 1 # inserting 1 representing option number i
vectors_list.append(list_to_be_vector.copy())
list_to_be_vector[i] = 0
mapping = dict(zip(feature_options, vectors_list)) # dict from values to vectors
df[new_name] = df[feature_name].map(mapping)
df.drop([feature_name], axis=1, inplace=True)
这是我的火车尝试(预处理后):
linereg = LinearRegression()
linereg.fit(train_df, target)
提前致谢。
LinearRegression
不支持列表功能。我看到你在使用 one-hot,你可以将每个维度用作一列特征。相比之下,您可以使用更简单的方法 pd.get_dummies
in pandas.
print(df['feature'])
0 Ex
1 Gd
2 Ta
3 Po
Name: feature, dtype: object
df = pd.get_dummies(df['feature'])
print(df)
Ex Gd Po Ta
0 1 0 0 0
1 0 1 0 0
2 0 0 0 1
3 0 0 1 0
我正在处理著名的 Kaggle 挑战 "House prices"。 我想用 sklearn.linear_model LinearRegression
训练我的数据集阅读以下文章后: https://developers.google.com/machine-learning/crash-course/representation/feature-engineering
我编写了一个函数,将我的火车 DataFrame 中的所有字符串值转换为列表。 例如,原始特征值可能看起来像这样 [Ex, Gd, Ta, Po],转换后它会像这样:[1,0,0,0] [0,1,0,0] [0, 0,1,0] [0,0,0,1].
当我尝试训练我的数据时,出现以下错误:
Traceback (most recent call last): File "C:/Users/Owner/PycharmProjects/HousePrices/main.py", line 27, in linereg.fit(train_df, target) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\linear_model\base.py", line 458, in fit y_numeric=True, multi_output=True) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y estimator=estimator) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py", line 567, in check_array array = array.astype(np.float64) ValueError: setting an array element with a sequence.
只有当我按照我的解释转换某些列时才会发生这种情况。
有什么方法可以训练以向量为值的线性回归模型吗?
这是我的转换函数:
def feature_to_boolean_vector(df, feature_name, new_name):
vectors_list = [] #each tuple will represent an option
feature_options = df[feature_name].unique()
feature_options_length = len(feature_options)
# creating a list the size of feature_options_length, all 0's
list_to_be_vector = [0 for i in range(feature_options_length)]
for i in range(feature_options_length):
list_to_be_vector[i] = 1 # inserting 1 representing option number i
vectors_list.append(list_to_be_vector.copy())
list_to_be_vector[i] = 0
mapping = dict(zip(feature_options, vectors_list)) # dict from values to vectors
df[new_name] = df[feature_name].map(mapping)
df.drop([feature_name], axis=1, inplace=True)
这是我的火车尝试(预处理后):
linereg = LinearRegression()
linereg.fit(train_df, target)
提前致谢。
LinearRegression
不支持列表功能。我看到你在使用 one-hot,你可以将每个维度用作一列特征。相比之下,您可以使用更简单的方法 pd.get_dummies
in pandas.
print(df['feature'])
0 Ex
1 Gd
2 Ta
3 Po
Name: feature, dtype: object
df = pd.get_dummies(df['feature'])
print(df)
Ex Gd Po Ta
0 1 0 0 0
1 0 1 0 0
2 0 0 0 1
3 0 0 1 0