我坚持将 CSV 数据集（字符串列）编码为训练数据

Question

我正在尝试将我的数据框数据（字符串列）从 csv 文件中匹配到我的test_data[特征]。

我的代码如下：

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor

# Set option to display all the rows and columns in the dataset. If there are more rows, adjust number accordingly.
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Pandas needs you to define the column as date before its imported and then call the column and define as a date
# hence this step.
date_col = ['Date']
df = pd.read_csv(
    r'C:\Users\harsh\Documents\My Dream\Desktop\Machine Learning\Attempt1\Historical Data\Concat_Cleaned.csv'
    , parse_dates=date_col, skiprows=0, low_memory=False)

# Converting/defining the columns
# Before you define column types, you need to fill all NaN with a value. We will be reconverting them later
df = df.fillna(101)
# Defining column types
convert_dict = {'League_Division': str,
                'HomeTeam': str,
                'AwayTeam': str,
                'Full_Time_Home_Goals': int,
                'Full_Time_Away_Goals': int,
                'Full_Time_Result': str,
                'Half_Time_Home_Goals': int,
                'Half_Time_Away_Goals': int,
                'Half_Time_Result': str,
                'Attendance': int,
                'Referee': str,
                'Home_Team_Shots': int,
                'Away_Team_Shots': int,
                'Home_Team_Shots_on_Target': int,
                'Away_Team_Shots_on_Target': int,
                'Home_Team_Hit_Woodwork': int,
                'Away_Team_Hit_Woodwork': int,
                'Home_Team_Corners': int,
                'Away_Team_Corners': int,
                'Home_Team_Fouls': int,
                'Away_Team_Fouls': int,
                'Home_Offsides': int,
                'Away_Offsides': int,
                'Home_Team_Yellow_Cards': int,
                'Away_Team_Yellow_Cards': int,
                'Home_Team_Red_Cards': int,
                'Away_Team_Red_Cards': int,
                'Home_Team_Bookings_Points': float,
                'Away_Team_Bookings_Points': float,
                }

df = df.astype(convert_dict)

# Reverting the replace values step to get original dataframe and with the defined filetypes
df = df.replace('101', np.NAN, regex=True)
df = df.replace(101, np.NAN, regex=True)

# Clean dataset by dropping null rows
data = df.dropna(axis=0)

# Column that you want to predict = y
y = data.Full_Time_Home_Goals

# Columns that are inputted into the model to make predictions (dependants), Cannot be column y
features = ['HomeTeam', 'AwayTeam', 'Full_Time_Away_Goals', 'Full_Time_Result']
# Create X
X = data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
soccer_model = DecisionTreeRegressor(random_state=1)

# Define and train OneHotEncoder to transform numerical data to a numeric array
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_X, train_y)

transformed_train_X = enc.transform(train_X)
transformed_val_X = enc.transform(val_X)

# Fit Model
soccer_model.fit(transformed_train_X, train_y)

#  Make validation predictions and calculate mean absolute error
val_predictions = soccer_model.predict(transformed_val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes : {:,.5f}".format(val_mae))

# Using best value for max_leaf_nodes
data_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
data_model.fit(transformed_train_X, train_y)
val_predictions = data_model.predict(transformed_val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes : {:,.5f}".format(val_mae))

# Build a Random Forest model and train it on all of X and y.
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor()
# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(transformed_train_X, train_y)

# path to file you will use for predictions
date_col_n = ['Date']
test_data = pd.read_csv(
    r'C:\Users\harsh\Documents\My Dream\Desktop\Machine Learning\Attempt1\EPL_2021_Timetable.csv'
    , parse_dates=date_col_n, skiprows=0, low_memory=False)
# Define columns we want to use for prediction
columns = ['Home_Team', 'Away_Team']
test_data = test_data[columns]
# Renaming Column Names to match with training dataset
test_data = test_data.rename({'Home_Team': 'HomeTeam', 'Away_Team': 'AwayTeam'}, axis=1)
# Adding NaN columns to dataset to match the training dataset
test_data['Full_Time_Result'] = np.nan
test_data['Full_Time_Away_Goals'] = np.nan

# Encoding the string columns
enc.fit('HomeTeam', 'AwayTeam')
HomeTeam = enc.transform(HomeTeam)
AwayTeam = enc.transform(AwayTeam)


# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]
# make predictions which we will submit.
test_preds = rf_model_on_full_data.predict(test_X)

p.s。我包含了一些额外的代码，只是为了提供我想要达到的目标的方向。

我在 enc.fit('HomeTeam', 'AwayTeam') 处收到错误我不清楚如何将它们包含到我的 [features] 数据帧中

我得到的错误是

ValueError: Expected 2D array, got scalar array instead: array=HomeTeam. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

请查找我的示例训练数据集here and my dataset for prediction here

Answer 1

我犯的错误是试图直接拟合列而不拆分为训练和验证数据集。这是可用于拟合的代码：

test_data_features = test_data[features]
# Filling all NA values as Encoder cannot handle nan values
df = test_data.fillna(1)

# Define Y for Fitting
Y = df

# We need nY as that would be the column used for splitting
ny = df.Full_Time_Home_Goals

# We need to encode and transform dataset so we have converted all nan to 1 and we are defining a new model as the
# val_x values are confusing, we will use n_
train_n_X, val_n_X, train_n_y, val_n_y = train_test_split(Y, ny, random_state=1)

# Since we have text again, we will need fitting and transforming the data
enc.fit(train_n_X, train_n_y)
transformed_train_n_X = enc.transform(train_n_X)
transformed_val_n_X = enc.transform(val_n_X)

# Fitting and then we will be using predict
rf_model_on_full_data.fit(transformed_train_n_X, train_n_y)

我坚持将 CSV 数据集（字符串列）编码为训练数据

I am stuck at encoding CSV dataset (String columns) to Training data

string

predict

python-3.x

sklearn-pandas

one-hot-encoding