将 Pandas DF 转换为 Numpy 数组会在尝试预测时出现 # of features 错误?
Converting Pandas DF to Numpy Array gives me a # of features error when trying to predict?
我设置了一个 TPOT 回归器来预测数据集上的股票价格(经过一些特征工程之后),当我 运行 遇到涉及 XGBoost 回归器的问题时,我会收到一条错误消息说:
feature_names mismatch:
...然后它会显示我的数据集的列名列表。 Github 针对此问题提出了一个解决方案,建议在 train_test_split 期间将 X 特征和 Y 标签的数据帧转换为 Numpy 数组来处理它,这就是我所做的,但现在我收到一个错误:
X_train, X_test, Y_train, Y_test = train_test_split(X.values, Y.values, test_size = test_size, random_state = seed)
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
[INFO] Printing the shapes of the training/testing feature/label sets...
(1374, 68)
(459, 68)
(1374,)
(459,)
Best pipeline: ExtraTreesRegressor(DecisionTreeRegressor(input_matrix, max_depth=1, min_samples_leaf=9, min_samples_split=11), bootstrap=False, max_features=0.8500000000000001, min_samples_leaf=1, min_samples_split=9, n_estimators=100)
Traceback (most recent call last):
File "main2.py", line 656, in <module>
predictions = best_model.predict(X_test)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\tpot\base.py", line 921, in predict
return self.fitted_pipeline_.predict(features)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\metaestimators.py", line 116, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\pipeline.py", line 422, in predict
return self.steps[-1][-1].predict(Xt, **predict_params)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 693, in predict
X = self._validate_X_predict(X)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 359, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree\tree.py", line 402, in _validate_X_predict
% (self.n_features_, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 68 and input n_features is 69
Github 上的问题现已关闭,但我希望这里有人可以解释我在这里遗漏了什么?如您所见,有 68 个特征列和 1 个标签列。而且你还会看到这次模型甚至没有使用 XGBoost,但我希望能够使用 .predict()
函数来处理它提出的任何模型。
更新代码
好吧,我真的被困在这里了。我在下面发布了一个工作代码来复制错误。让我知道你看到了什么。输入股票代码 CLVS。我在整个过程中添加了数据框和数组的打印形状,它仍然说形状很好,所以我没有看到什么?您需要 Pandas 0.23(是的旧版本)并安装 TPOT 和 DASK。感谢:
def main():
# 1. Input a stock ticker
ticker_input = input('Which stock ticker would you like to predict?') # Start with CLVS for testing
print('Getting the historical data for: ',ticker_input)
# 2. Download the historical daily data
# Import dependencies
from datetime import datetime
from pandas_datareader import data as web
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import seaborn as sns
import matplotlib.pyplot as plt
import random
import os
import numpy as np
import time
# Downloading historical data as dataframe
ex = 'yahoo'
start = datetime(2000, 1, 1)
end = datetime.now()
dataset = web.DataReader(ticker_input, ex, start, end) #.reset_index()
# 3. Construct the dataframe from the historical data
# Only use the Adj Close, and use the open price
# of the current day. Then shift all the other
# data 1 day to make the dataset include the
# previous day's values for each.
# (This is because on the trading day, we won't know what the
# High or Low or Close or Volume is, but we would
# know the Open.)
dataset = dataset.drop(['Close'],axis=1)
dataset['PrevOpen'] = dataset['Open'].shift(1)
dataset['PrevHigh'] = dataset['High'].shift(1)
dataset['PrevLow'] = dataset['Low'].shift(1)
dataset['PrevAdjClose'] = dataset['Adj Close'].shift(1)
dataset['PrevVol'] = dataset['Volume'].shift(1)
dataset = dataset.drop(['High'],axis=1)
dataset = dataset.drop(['Low'],axis=1)
dataset = dataset.drop(['Volume'],axis=1)
# Add in moving averages based on Opening prices
dataset['9MA'] = dataset['Open'].rolling(window=9).mean()
dataset['20MA'] = dataset['Open'].rolling(window=20).mean()
# Get which industry the stock is in to get the industry performance data
from bs4 import BeautifulSoup
import requests
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
# Get the industry name of the stock
url = 'https://finance.yahoo.com/quote/' + ticker_input + '/profile'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('p', {'class' :'D(ib) Va(t)'})
industry = table.findAll('span')
indust = industry[3].text
print(indust)
print('Getting Industry ETF historical data...')
# Then get historical data for that industry's ETF
if indust == "Biotechnology":
etf_ticker = "IBB"
elif indust == "Specialty Retail":
etf_ticker = "XRT"
elif indust == "Oil & Gas E&P":
etf_ticker = "XOP"
ex = 'yahoo'
etf_df = web.DataReader(etf_ticker, ex, start, end)
dataset['PrevIndOpen'] = etf_df['Open'].shift(1)
dataset['PrevIndHigh'] = etf_df['High'].shift(1)
dataset['PrevIndLow'] = etf_df['Low'].shift(1)
dataset['PrevIndClose'] = etf_df['Adj Close'].shift(1)
dataset['PrevIndVol'] = etf_df['Volume'].shift(1)
# Reshape the dataframe to put Adj Close at the far right
# so when we export the predictions dataset, the predictions
# column will be right next to it for easier analysis
dataset = dataset[['Open','9MA','20MA','PrevOpen','PrevHigh','PrevLow','PrevAdjClose','PrevVol','PrevIndOpen','PrevIndHigh','PrevIndLow','PrevIndClose','PrevIndVol','Adj Close']]
# Disable the Future Warnings that repeat "needlessly" (for now)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")
# 5. Explore the inital dataset
# Show the shape of the dataset
print("[INFO] features shape : {}".format(dataset.shape))
# Print the feature names
print("[INFO] dataset names : {}".format(dataset.columns))
# Convert the dataframe into a Pandas dataframe and print the first 5 rows
df = pd.DataFrame(dataset)
print("[INFO] df type : {}".format(type(df)))
print("[INFO] df shape: {}".format(df.shape))
print(df.head())
# Specify the column names and print
df.columns = dataset.columns
#print('[INFO] df shape with features:')
#print(df.head())
# This prints the same as above
# Find any columns with missing values? If you find them, you either have to:
# 1. Replace the missing value with a large negative number (e.g. -999).
# 2. Replace the missing value with mean of the column.
# 3. Replace the missing value with median of the column.
# Because of our 1 day shift, the first row will have empty values,
# so we'll drop them as one day won't make much difference in our entire model
print('[INFO] Checking for any columns with no values...')
df = df.dropna(how='any')
print(pd.isnull(df).any())
# Ensure numeric datatypes of the dataframe.
# If a column has different datatype such as string or character,
# we need to map that column to a numeric datatype such as integer
# or float. For this dataset, the Date index column is one.
print('[INFO] Feature types:')
print(df.dtypes)
# Print a statistical summary of the dataset for reference
print('[INFO] Print a statistical summary of dataset:')
print(df.describe())
# # Reset the index column for FeatureTools to use Date as the index, then it'll revert it back after feature stuff is done
# df = df.reset_index()
# This is not good way to drop the rows here because if there are any
# nan values in the middle of the dataset, those will get lost too.
# Need to work with this
df = df.dropna()
print(df)
# 4. Hold out a prediction dataset to predict on later
prediction_df = df.tail(90).copy()
df = df.iloc[:-90,:].copy() # subtracting 90 rows/days from the dataset to use as the predictions dataset later
# 7. Split the dataset into features (X) and target (Y)
# Split into features (x) and target (y) and print the shapes of them
X = df.drop("Adj Close", axis=1)
Y = df["Adj Close"]
print('Shape of features: ', X.shape)
print('Shape of target: ', Y.shape)
# Standardize the data. Commenting this out until you can figure out how to
# unscale the prediction dataset for review
#from sklearn.preprocessing import StandardScaler, MinMaxScaler
#scaler = MinMaxScaler().fit(X)
#scaled_X = scaler.transform(X)
print('Printing X and Y shape :')
print(X.shape)
print(Y.shape)
# 8. Split dataset into training and validation data
# Split the data into training and testing data and print their shapes
from sklearn.model_selection import train_test_split
seed = 9
test_size = 0.25
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
X_train=X_train.values
X_test=X_test.values
Y_train=Y_train.values
Y_test=Y_test.values
print('[INFO] Printing the arrays of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
# 9. Start a TPOT Auto Regression to find the best Regression model and export feature importances
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from tpot import TPOTRegressor
import os
# Create a custom config dictionary for TPOT to use.
# I've made this list full of Regressors that use the
# .feature_importances_ attribute. How to implement XGBoost
# into the plotting of feature importances below? IF XGBOOST is
# present in the final model, then plot one way, ELSE, plot the
# way it is now?
tpot_config = {
'sklearn.ensemble.ExtraTreesRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
},
'sklearn.tree.DecisionTreeRegressor': {
'max_depth': range(1, 11),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21)
},
'sklearn.ensemble.RandomForestRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
},
# Preprocesssors
'sklearn.preprocessing.Binarizer': {
'threshold': np.arange(0.0, 1.01, 0.05)
},
'sklearn.decomposition.FastICA': {
'tol': np.arange(0.0, 1.01, 0.05)
},
'sklearn.cluster.FeatureAgglomeration': {
'linkage': ['ward', 'complete', 'average'],
'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']
},
'sklearn.preprocessing.MaxAbsScaler': {
},
'sklearn.preprocessing.MinMaxScaler': {
},
'sklearn.preprocessing.Normalizer': {
'norm': ['l1', 'l2', 'max']
},
'sklearn.kernel_approximation.Nystroem': {
'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'],
'gamma': np.arange(0.0, 1.01, 0.05),
'n_components': range(1, 11)
},
'sklearn.decomposition.PCA': {
'svd_solver': ['randomized'],
'iterated_power': range(1, 11)
},
'sklearn.preprocessing.PolynomialFeatures': {
'degree': [2],
'include_bias': [False],
'interaction_only': [False]
},
'sklearn.kernel_approximation.RBFSampler': {
'gamma': np.arange(0.0, 1.01, 0.05)
},
'sklearn.preprocessing.RobustScaler': {
},
'sklearn.preprocessing.StandardScaler': {
},
'tpot.builtins.ZeroCount': {
},
'tpot.builtins.OneHotEncoder': {
'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
'sparse': [False],
'threshold': [10]
},
# Selectors
'sklearn.feature_selection.SelectFwe': {
'alpha': np.arange(0, 0.05, 0.001),
'score_func': {
'sklearn.feature_selection.f_regression': None
}
},
'sklearn.feature_selection.SelectPercentile': {
'percentile': range(1, 100),
'score_func': {
'sklearn.feature_selection.f_regression': None
}
},
'sklearn.feature_selection.VarianceThreshold': {
'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
},
'sklearn.feature_selection.SelectFromModel': {
'threshold': np.arange(0, 1.01, 0.05),
'estimator': {
'sklearn.ensemble.ExtraTreesRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05)
}
}
}
}
# Cross Validation folds to run
folds = 10
# Start the TPOT regression
best_model = TPOTRegressor(use_dask=True,n_jobs=-1,config_dict=tpot_config, cv=folds,
generations=5, population_size=20, verbosity=2, random_state=seed) #memory='./PipelineCache', memory='auto',
best_model.fit(X_train, Y_train)
# Export the TPOT pipeline if you want to use it for anything later
if os.path.exists('./Exported Pipelines'):
pass
else:
os.mkdir('./Exported Pipelines')
best_model.export('./Exported Pipelines/' + ticker_input + '-prediction-pipeline.py')
# Extract what the best pipeline was and fit it to the training set
# to get an idea of the most important features used by the model
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
# Train the `exctracted_best_model` using the training/vildation set.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X_train, Y_train)
# plot model's feature importance and save the plot for later
feature_importance = exctracted_best_model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, df.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.savefig("feature_importance.png")
plt.clf()
plt.close()
print(X_test.shape)
# 10. See the stats of the validation predictions from the tuned model and export more plots
# Make predictions using the tuned model and display error metrics
# R2 and Explained Variance, best is 1
predictions = best_model.predict(X_test)
print('=============================')
print("TPOT's final score on testing dataset is : ", best_model.score(X_test, Y_test))
print('=============================')
print("[INFO] MSE on test set : {}".format(round(mean_squared_error(Y_test, predictions), 3)))
print('[INFO] R2 Score on test set : {}'.format(round(r2_score(Y_test, predictions), 3)))
print('[INFO] Explained Variance Score on test set : {}'.format(round(explained_variance_score(Y_test, predictions), 3)))
# Plot between predictions and Y_test
x_axis = np.array(range(0, predictions.shape[0]))
plt.plot(x_axis, predictions, linestyle="--", marker="o", alpha=0.7, color='r', label="predictions")
plt.plot(x_axis, Y_test, linestyle="--", marker="o", alpha=0.7, color='g', label="Y_test")
plt.xlabel('Row number')
plt.ylabel('PRICE')
plt.title('Predictions vs Y_test')
plt.legend(loc='lower right')
plt.savefig("predictions_vs_ytest.png")
plt.clf()
plt.close()
# 11. Use the model on the held-out prediction dataset
# Now, run the model on the prediction dataset
features = prediction_df.drop(['Adj Close'], axis=1)
labels = prediction_df['Adj Close']
# Fit the model to the prediction_df and predict the labels
#tpot.fit(features, labels)
results = best_model.predict(features)
predictions_list = []
for preds in results:
predictions_list.append(preds)
prediction_df['Predictions'] = predictions_list
prediction_df.to_csv('Final Predictions Performance.csv', index=True)
print('============================')
print("[INFO] MSE on prediction set : {}".format(round(mean_squared_error(labels, results), 3)))
print('[INFO] R2 Score on prediction set : {}'.format(round(r2_score(labels, results), 3)))
print('[INFO] Explained Variance Score on prediction set : {}'.format(round(explained_variance_score(labels, results), 3)))
# 12. Review the exported .csv file of the predictions, and review all your plots
print('DONE!')
if __name__ == "__main__":
main()
看来我找到了解决办法。我已经 运行 一些使用 XGBRegressor 和 RandomDecisionTrees 的模型,它似乎在工作。
只需打开 "X_train=X_train.values" 和 "X_test=X_test.values",但将 Y 作为数据框单独保留,因为当我更改这两个组时,出现错误。所以我暂时保留它。
我设置了一个 TPOT 回归器来预测数据集上的股票价格(经过一些特征工程之后),当我 运行 遇到涉及 XGBoost 回归器的问题时,我会收到一条错误消息说:
feature_names mismatch:
...然后它会显示我的数据集的列名列表。 Github 针对此问题提出了一个解决方案,建议在 train_test_split 期间将 X 特征和 Y 标签的数据帧转换为 Numpy 数组来处理它,这就是我所做的,但现在我收到一个错误:
X_train, X_test, Y_train, Y_test = train_test_split(X.values, Y.values, test_size = test_size, random_state = seed)
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
[INFO] Printing the shapes of the training/testing feature/label sets...
(1374, 68)
(459, 68)
(1374,)
(459,)
Best pipeline: ExtraTreesRegressor(DecisionTreeRegressor(input_matrix, max_depth=1, min_samples_leaf=9, min_samples_split=11), bootstrap=False, max_features=0.8500000000000001, min_samples_leaf=1, min_samples_split=9, n_estimators=100)
Traceback (most recent call last):
File "main2.py", line 656, in <module>
predictions = best_model.predict(X_test)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\tpot\base.py", line 921, in predict
return self.fitted_pipeline_.predict(features)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\metaestimators.py", line 116, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\pipeline.py", line 422, in predict
return self.steps[-1][-1].predict(Xt, **predict_params)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 693, in predict
X = self._validate_X_predict(X)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 359, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree\tree.py", line 402, in _validate_X_predict
% (self.n_features_, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 68 and input n_features is 69
Github 上的问题现已关闭,但我希望这里有人可以解释我在这里遗漏了什么?如您所见,有 68 个特征列和 1 个标签列。而且你还会看到这次模型甚至没有使用 XGBoost,但我希望能够使用 .predict()
函数来处理它提出的任何模型。
更新代码
好吧,我真的被困在这里了。我在下面发布了一个工作代码来复制错误。让我知道你看到了什么。输入股票代码 CLVS。我在整个过程中添加了数据框和数组的打印形状,它仍然说形状很好,所以我没有看到什么?您需要 Pandas 0.23(是的旧版本)并安装 TPOT 和 DASK。感谢:
def main():
# 1. Input a stock ticker
ticker_input = input('Which stock ticker would you like to predict?') # Start with CLVS for testing
print('Getting the historical data for: ',ticker_input)
# 2. Download the historical daily data
# Import dependencies
from datetime import datetime
from pandas_datareader import data as web
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import seaborn as sns
import matplotlib.pyplot as plt
import random
import os
import numpy as np
import time
# Downloading historical data as dataframe
ex = 'yahoo'
start = datetime(2000, 1, 1)
end = datetime.now()
dataset = web.DataReader(ticker_input, ex, start, end) #.reset_index()
# 3. Construct the dataframe from the historical data
# Only use the Adj Close, and use the open price
# of the current day. Then shift all the other
# data 1 day to make the dataset include the
# previous day's values for each.
# (This is because on the trading day, we won't know what the
# High or Low or Close or Volume is, but we would
# know the Open.)
dataset = dataset.drop(['Close'],axis=1)
dataset['PrevOpen'] = dataset['Open'].shift(1)
dataset['PrevHigh'] = dataset['High'].shift(1)
dataset['PrevLow'] = dataset['Low'].shift(1)
dataset['PrevAdjClose'] = dataset['Adj Close'].shift(1)
dataset['PrevVol'] = dataset['Volume'].shift(1)
dataset = dataset.drop(['High'],axis=1)
dataset = dataset.drop(['Low'],axis=1)
dataset = dataset.drop(['Volume'],axis=1)
# Add in moving averages based on Opening prices
dataset['9MA'] = dataset['Open'].rolling(window=9).mean()
dataset['20MA'] = dataset['Open'].rolling(window=20).mean()
# Get which industry the stock is in to get the industry performance data
from bs4 import BeautifulSoup
import requests
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
# Get the industry name of the stock
url = 'https://finance.yahoo.com/quote/' + ticker_input + '/profile'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('p', {'class' :'D(ib) Va(t)'})
industry = table.findAll('span')
indust = industry[3].text
print(indust)
print('Getting Industry ETF historical data...')
# Then get historical data for that industry's ETF
if indust == "Biotechnology":
etf_ticker = "IBB"
elif indust == "Specialty Retail":
etf_ticker = "XRT"
elif indust == "Oil & Gas E&P":
etf_ticker = "XOP"
ex = 'yahoo'
etf_df = web.DataReader(etf_ticker, ex, start, end)
dataset['PrevIndOpen'] = etf_df['Open'].shift(1)
dataset['PrevIndHigh'] = etf_df['High'].shift(1)
dataset['PrevIndLow'] = etf_df['Low'].shift(1)
dataset['PrevIndClose'] = etf_df['Adj Close'].shift(1)
dataset['PrevIndVol'] = etf_df['Volume'].shift(1)
# Reshape the dataframe to put Adj Close at the far right
# so when we export the predictions dataset, the predictions
# column will be right next to it for easier analysis
dataset = dataset[['Open','9MA','20MA','PrevOpen','PrevHigh','PrevLow','PrevAdjClose','PrevVol','PrevIndOpen','PrevIndHigh','PrevIndLow','PrevIndClose','PrevIndVol','Adj Close']]
# Disable the Future Warnings that repeat "needlessly" (for now)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")
# 5. Explore the inital dataset
# Show the shape of the dataset
print("[INFO] features shape : {}".format(dataset.shape))
# Print the feature names
print("[INFO] dataset names : {}".format(dataset.columns))
# Convert the dataframe into a Pandas dataframe and print the first 5 rows
df = pd.DataFrame(dataset)
print("[INFO] df type : {}".format(type(df)))
print("[INFO] df shape: {}".format(df.shape))
print(df.head())
# Specify the column names and print
df.columns = dataset.columns
#print('[INFO] df shape with features:')
#print(df.head())
# This prints the same as above
# Find any columns with missing values? If you find them, you either have to:
# 1. Replace the missing value with a large negative number (e.g. -999).
# 2. Replace the missing value with mean of the column.
# 3. Replace the missing value with median of the column.
# Because of our 1 day shift, the first row will have empty values,
# so we'll drop them as one day won't make much difference in our entire model
print('[INFO] Checking for any columns with no values...')
df = df.dropna(how='any')
print(pd.isnull(df).any())
# Ensure numeric datatypes of the dataframe.
# If a column has different datatype such as string or character,
# we need to map that column to a numeric datatype such as integer
# or float. For this dataset, the Date index column is one.
print('[INFO] Feature types:')
print(df.dtypes)
# Print a statistical summary of the dataset for reference
print('[INFO] Print a statistical summary of dataset:')
print(df.describe())
# # Reset the index column for FeatureTools to use Date as the index, then it'll revert it back after feature stuff is done
# df = df.reset_index()
# This is not good way to drop the rows here because if there are any
# nan values in the middle of the dataset, those will get lost too.
# Need to work with this
df = df.dropna()
print(df)
# 4. Hold out a prediction dataset to predict on later
prediction_df = df.tail(90).copy()
df = df.iloc[:-90,:].copy() # subtracting 90 rows/days from the dataset to use as the predictions dataset later
# 7. Split the dataset into features (X) and target (Y)
# Split into features (x) and target (y) and print the shapes of them
X = df.drop("Adj Close", axis=1)
Y = df["Adj Close"]
print('Shape of features: ', X.shape)
print('Shape of target: ', Y.shape)
# Standardize the data. Commenting this out until you can figure out how to
# unscale the prediction dataset for review
#from sklearn.preprocessing import StandardScaler, MinMaxScaler
#scaler = MinMaxScaler().fit(X)
#scaled_X = scaler.transform(X)
print('Printing X and Y shape :')
print(X.shape)
print(Y.shape)
# 8. Split dataset into training and validation data
# Split the data into training and testing data and print their shapes
from sklearn.model_selection import train_test_split
seed = 9
test_size = 0.25
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
X_train=X_train.values
X_test=X_test.values
Y_train=Y_train.values
Y_test=Y_test.values
print('[INFO] Printing the arrays of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
# 9. Start a TPOT Auto Regression to find the best Regression model and export feature importances
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from tpot import TPOTRegressor
import os
# Create a custom config dictionary for TPOT to use.
# I've made this list full of Regressors that use the
# .feature_importances_ attribute. How to implement XGBoost
# into the plotting of feature importances below? IF XGBOOST is
# present in the final model, then plot one way, ELSE, plot the
# way it is now?
tpot_config = {
'sklearn.ensemble.ExtraTreesRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
},
'sklearn.tree.DecisionTreeRegressor': {
'max_depth': range(1, 11),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21)
},
'sklearn.ensemble.RandomForestRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
},
# Preprocesssors
'sklearn.preprocessing.Binarizer': {
'threshold': np.arange(0.0, 1.01, 0.05)
},
'sklearn.decomposition.FastICA': {
'tol': np.arange(0.0, 1.01, 0.05)
},
'sklearn.cluster.FeatureAgglomeration': {
'linkage': ['ward', 'complete', 'average'],
'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']
},
'sklearn.preprocessing.MaxAbsScaler': {
},
'sklearn.preprocessing.MinMaxScaler': {
},
'sklearn.preprocessing.Normalizer': {
'norm': ['l1', 'l2', 'max']
},
'sklearn.kernel_approximation.Nystroem': {
'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'],
'gamma': np.arange(0.0, 1.01, 0.05),
'n_components': range(1, 11)
},
'sklearn.decomposition.PCA': {
'svd_solver': ['randomized'],
'iterated_power': range(1, 11)
},
'sklearn.preprocessing.PolynomialFeatures': {
'degree': [2],
'include_bias': [False],
'interaction_only': [False]
},
'sklearn.kernel_approximation.RBFSampler': {
'gamma': np.arange(0.0, 1.01, 0.05)
},
'sklearn.preprocessing.RobustScaler': {
},
'sklearn.preprocessing.StandardScaler': {
},
'tpot.builtins.ZeroCount': {
},
'tpot.builtins.OneHotEncoder': {
'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
'sparse': [False],
'threshold': [10]
},
# Selectors
'sklearn.feature_selection.SelectFwe': {
'alpha': np.arange(0, 0.05, 0.001),
'score_func': {
'sklearn.feature_selection.f_regression': None
}
},
'sklearn.feature_selection.SelectPercentile': {
'percentile': range(1, 100),
'score_func': {
'sklearn.feature_selection.f_regression': None
}
},
'sklearn.feature_selection.VarianceThreshold': {
'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
},
'sklearn.feature_selection.SelectFromModel': {
'threshold': np.arange(0, 1.01, 0.05),
'estimator': {
'sklearn.ensemble.ExtraTreesRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05)
}
}
}
}
# Cross Validation folds to run
folds = 10
# Start the TPOT regression
best_model = TPOTRegressor(use_dask=True,n_jobs=-1,config_dict=tpot_config, cv=folds,
generations=5, population_size=20, verbosity=2, random_state=seed) #memory='./PipelineCache', memory='auto',
best_model.fit(X_train, Y_train)
# Export the TPOT pipeline if you want to use it for anything later
if os.path.exists('./Exported Pipelines'):
pass
else:
os.mkdir('./Exported Pipelines')
best_model.export('./Exported Pipelines/' + ticker_input + '-prediction-pipeline.py')
# Extract what the best pipeline was and fit it to the training set
# to get an idea of the most important features used by the model
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
# Train the `exctracted_best_model` using the training/vildation set.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X_train, Y_train)
# plot model's feature importance and save the plot for later
feature_importance = exctracted_best_model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, df.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.savefig("feature_importance.png")
plt.clf()
plt.close()
print(X_test.shape)
# 10. See the stats of the validation predictions from the tuned model and export more plots
# Make predictions using the tuned model and display error metrics
# R2 and Explained Variance, best is 1
predictions = best_model.predict(X_test)
print('=============================')
print("TPOT's final score on testing dataset is : ", best_model.score(X_test, Y_test))
print('=============================')
print("[INFO] MSE on test set : {}".format(round(mean_squared_error(Y_test, predictions), 3)))
print('[INFO] R2 Score on test set : {}'.format(round(r2_score(Y_test, predictions), 3)))
print('[INFO] Explained Variance Score on test set : {}'.format(round(explained_variance_score(Y_test, predictions), 3)))
# Plot between predictions and Y_test
x_axis = np.array(range(0, predictions.shape[0]))
plt.plot(x_axis, predictions, linestyle="--", marker="o", alpha=0.7, color='r', label="predictions")
plt.plot(x_axis, Y_test, linestyle="--", marker="o", alpha=0.7, color='g', label="Y_test")
plt.xlabel('Row number')
plt.ylabel('PRICE')
plt.title('Predictions vs Y_test')
plt.legend(loc='lower right')
plt.savefig("predictions_vs_ytest.png")
plt.clf()
plt.close()
# 11. Use the model on the held-out prediction dataset
# Now, run the model on the prediction dataset
features = prediction_df.drop(['Adj Close'], axis=1)
labels = prediction_df['Adj Close']
# Fit the model to the prediction_df and predict the labels
#tpot.fit(features, labels)
results = best_model.predict(features)
predictions_list = []
for preds in results:
predictions_list.append(preds)
prediction_df['Predictions'] = predictions_list
prediction_df.to_csv('Final Predictions Performance.csv', index=True)
print('============================')
print("[INFO] MSE on prediction set : {}".format(round(mean_squared_error(labels, results), 3)))
print('[INFO] R2 Score on prediction set : {}'.format(round(r2_score(labels, results), 3)))
print('[INFO] Explained Variance Score on prediction set : {}'.format(round(explained_variance_score(labels, results), 3)))
# 12. Review the exported .csv file of the predictions, and review all your plots
print('DONE!')
if __name__ == "__main__":
main()
看来我找到了解决办法。我已经 运行 一些使用 XGBRegressor 和 RandomDecisionTrees 的模型,它似乎在工作。
只需打开 "X_train=X_train.values" 和 "X_test=X_test.values",但将 Y 作为数据框单独保留,因为当我更改这两个组时,出现错误。所以我暂时保留它。