Scikit Learn,如何在外部数据集上使用局部线性嵌入
ScikitLearn, How to use Locally Linear Embedding on external datasets
使用以下网站:
https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#sphx-glr-auto-examples-manifold-plot-lle-digits-py
https://scikit-learn.org/stable/auto_examples/manifold/plot_swissroll.html#sphx-glr-auto-examples-manifold-plot-swissroll-py
我设法在 MNIST 数据集和 swissroll 数据集上获得了 LLE,但不知何故我不明白如何在 https://www.kaggle.com/manufacturingai/predicting-fraud-w-fast-ai 这样的外部数据集上获得它 运行。
我的尝试如下:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from matplotlib import offsetbox
from sklearn import (manifold, datasets)
n_neighbors = 30
f_fontsize = 8
data = np.genfromtxt('../content/creditcard.csv', skip_header=True)
features = data[:, :3]
targets = data[:, 3] # The last column is identified as the target
def plotcreditfraudfig(X, color, X_sr, err):
fig = plt.figure()
ax = fig.add_subplot(211, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2],cmap=plt.cm.Spectral)
ax.set_title("Original data")
ax = fig.add_subplot(212)
ax.scatter(X_sr[:, 0], X_sr[:, 1],cmap=plt.cm.Spectral)
plt.axis('tight')
plt.xticks([]), plt.yticks([])
plt.title('Projected data')
plt.show()
clf = manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=2, method='standard')
clf.fit(X=features, y=targets)
print("Done. Reconstruction error: %g" %clf.reconstruction_error_)
X_llecf=clf.transform(X)
plot_embedding(X_llecf, "Locally Linear Embedding")
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-106-91224a1ba194> in <module>()
1 data = np.genfromtxt('../content/creditcard.csv', skip_header=True)
----> 2 features = data[:, :3]
3 targets = data[:, 3] # The last column is identified as the target
4
5 clf = manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=2, method='standard')
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
我通过将功能和目标更改为:
使其正常工作
X_features = data.drop('Class', axis=1)
y_targets = data['Class']
但我必须做更多:
因为矩阵不是半正定的,所以我必须在声明 X_features 和 y_targets:
之前清除一些行
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[indices_to_keep].astype(np.float64)
使用以下网站: https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#sphx-glr-auto-examples-manifold-plot-lle-digits-py https://scikit-learn.org/stable/auto_examples/manifold/plot_swissroll.html#sphx-glr-auto-examples-manifold-plot-swissroll-py
我设法在 MNIST 数据集和 swissroll 数据集上获得了 LLE,但不知何故我不明白如何在 https://www.kaggle.com/manufacturingai/predicting-fraud-w-fast-ai 这样的外部数据集上获得它 运行。
我的尝试如下:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from matplotlib import offsetbox
from sklearn import (manifold, datasets)
n_neighbors = 30
f_fontsize = 8
data = np.genfromtxt('../content/creditcard.csv', skip_header=True)
features = data[:, :3]
targets = data[:, 3] # The last column is identified as the target
def plotcreditfraudfig(X, color, X_sr, err):
fig = plt.figure()
ax = fig.add_subplot(211, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2],cmap=plt.cm.Spectral)
ax.set_title("Original data")
ax = fig.add_subplot(212)
ax.scatter(X_sr[:, 0], X_sr[:, 1],cmap=plt.cm.Spectral)
plt.axis('tight')
plt.xticks([]), plt.yticks([])
plt.title('Projected data')
plt.show()
clf = manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=2, method='standard')
clf.fit(X=features, y=targets)
print("Done. Reconstruction error: %g" %clf.reconstruction_error_)
X_llecf=clf.transform(X)
plot_embedding(X_llecf, "Locally Linear Embedding")
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-106-91224a1ba194> in <module>()
1 data = np.genfromtxt('../content/creditcard.csv', skip_header=True)
----> 2 features = data[:, :3]
3 targets = data[:, 3] # The last column is identified as the target
4
5 clf = manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=2, method='standard')
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
我通过将功能和目标更改为:
使其正常工作X_features = data.drop('Class', axis=1)
y_targets = data['Class']
但我必须做更多: 因为矩阵不是半正定的,所以我必须在声明 X_features 和 y_targets:
之前清除一些行def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[indices_to_keep].astype(np.float64)