在 运行 个模型之后,如何将隔离森林和局部离群因子保存为两个不同的模型?
After running a model how do I save an Isolation Forest and a Local Outlier Factor as two different models?
我一直在尝试编写一个机器学习程序,使用 sklearn
和 pandas
中的隔离森林和局部离群因子方法来检测信用卡欺诈。
我有代码 运行 并进行了预测,但我不知道如何将它们分别保存为不同的模型。我一直在关注一些例子,但不知道在哪里以及如何保存它。我认为它类似于 .save('Isolation.h5')
和 .save('Outlier.h5')
,但我不确定在 .save
前面放什么。
如果有人能帮助我了解如何保存每个模型,我们将不胜感激。
我当前的代码:
import numpy
import pandas
import matplotlib
import seaborn
import scipy
# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset from the csv file using pandas
data = pd.read_csv('C:/Users/super/OneDrive/Documents/School/Spring 2020/CS 657/Final Project/creditcard.csv')
# Start exploring the dataset
print(data.columns)
data = data.sample(frac=0.1, random_state = 1)
print(data.shape)
print(data.describe())
# V1 - V28 are the results of a PCA Dimensionality reduction to protect user identities and sensitive features
# Plot histograms of each parameter
data.hist(figsize = (20, 20))
plt.show()
# Determine number of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]
outlier_fraction = len(Fraud)/float(len(Valid))
print(outlier_fraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))
# Correlation matrix
corrmat = data.corr()
fig = plt.figure(figsize = (12, 9))
sns.heatmap(corrmat, vmax = .8, square = True)
plt.show()
# Get all the columns from the dataFrame
columns = data.columns.tolist()
# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we'll be predicting on
target = "Class"
X = data[columns]
Y = data[target]
# Print shapes
print(X.shape)
print(Y.shape)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
# define random states
state = 1
# define outlier detection tools to be compared
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors=20,
contamination=outlier_fraction)}
# Fit the model
plt.figure(figsize=(9, 7))
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid, 1 for fraud.
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# Run classification metrics
print('{}: {}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred))
print(classification_report(Y, y_pred))
由于您遍历所有分类器并训练 them/make 预测,因此您可以简单地同时保存模型。
例如,使用pickle
:
import pickle
def save_model(clf, filename):
with open(filename, 'wb') as f:
pickle.dump(clf, f)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
save_model(clf, 'Outlier.pkl') # Saving the LOF
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
save_model(clf, 'Isolation.pkl') # Saving the isolation forest
...
然后您可以使用以下方式加载模型:
def load_model(filename):
with open(filename, 'rb') as f:
clf = pickle.load(f)
return clf
您也可以保存为另一种格式,无论使用什么包,这个想法都是一样的。
我一直在尝试编写一个机器学习程序,使用 sklearn
和 pandas
中的隔离森林和局部离群因子方法来检测信用卡欺诈。
我有代码 运行 并进行了预测,但我不知道如何将它们分别保存为不同的模型。我一直在关注一些例子,但不知道在哪里以及如何保存它。我认为它类似于 .save('Isolation.h5')
和 .save('Outlier.h5')
,但我不确定在 .save
前面放什么。
如果有人能帮助我了解如何保存每个模型,我们将不胜感激。
我当前的代码:
import numpy
import pandas
import matplotlib
import seaborn
import scipy
# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset from the csv file using pandas
data = pd.read_csv('C:/Users/super/OneDrive/Documents/School/Spring 2020/CS 657/Final Project/creditcard.csv')
# Start exploring the dataset
print(data.columns)
data = data.sample(frac=0.1, random_state = 1)
print(data.shape)
print(data.describe())
# V1 - V28 are the results of a PCA Dimensionality reduction to protect user identities and sensitive features
# Plot histograms of each parameter
data.hist(figsize = (20, 20))
plt.show()
# Determine number of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]
outlier_fraction = len(Fraud)/float(len(Valid))
print(outlier_fraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))
# Correlation matrix
corrmat = data.corr()
fig = plt.figure(figsize = (12, 9))
sns.heatmap(corrmat, vmax = .8, square = True)
plt.show()
# Get all the columns from the dataFrame
columns = data.columns.tolist()
# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we'll be predicting on
target = "Class"
X = data[columns]
Y = data[target]
# Print shapes
print(X.shape)
print(Y.shape)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
# define random states
state = 1
# define outlier detection tools to be compared
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors=20,
contamination=outlier_fraction)}
# Fit the model
plt.figure(figsize=(9, 7))
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid, 1 for fraud.
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# Run classification metrics
print('{}: {}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred))
print(classification_report(Y, y_pred))
由于您遍历所有分类器并训练 them/make 预测,因此您可以简单地同时保存模型。
例如,使用pickle
:
import pickle
def save_model(clf, filename):
with open(filename, 'wb') as f:
pickle.dump(clf, f)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
save_model(clf, 'Outlier.pkl') # Saving the LOF
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
save_model(clf, 'Isolation.pkl') # Saving the isolation forest
...
然后您可以使用以下方式加载模型:
def load_model(filename):
with open(filename, 'rb') as f:
clf = pickle.load(f)
return clf
您也可以保存为另一种格式,无论使用什么包,这个想法都是一样的。