DBSCAN 异常检测
Anomalies Detection by DBSCAN
我在我的训练数据集上使用 DBSCAN,以便在训练模型之前找到异常值并从数据集中删除这些异常值。我在我的火车行 7697 上使用 DBSCAN,其中 8 columns.Here 是我的代码
from sklearn.cluster import DBSCAN
X = StandardScaler().fit_transform(X_train[all_features])
model = DBSCAN(eps=0.3 , min_samples=10).fit(X)
print (model)
X_train_1=X_train.drop(X_train[model.labels_==-1].index).copy()
X_train_1.reset_index(drop=True,inplace=True)
Q-1 在这 7 个中,有些是离散的,有些是连续的,是否可以同时缩放离散和连续的,或者只是连续的?
Q-2 我是否需要将集群映射到从训练中学习到的测试数据?
DBSCAN 将为您处理这些异常值。这就是它的用途。请参阅下面的示例,如果您还有其他问题,请post返回。
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
bins = 50,
title = "Histogram of the age variable"
)
from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]
ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
x = "age",
y = "fare",
c = clusters,
cmap = cmap,
colorbar = False
)
我在我的训练数据集上使用 DBSCAN,以便在训练模型之前找到异常值并从数据集中删除这些异常值。我在我的火车行 7697 上使用 DBSCAN,其中 8 columns.Here 是我的代码
from sklearn.cluster import DBSCAN
X = StandardScaler().fit_transform(X_train[all_features])
model = DBSCAN(eps=0.3 , min_samples=10).fit(X)
print (model)
X_train_1=X_train.drop(X_train[model.labels_==-1].index).copy()
X_train_1.reset_index(drop=True,inplace=True)
Q-1 在这 7 个中,有些是离散的,有些是连续的,是否可以同时缩放离散和连续的,或者只是连续的? Q-2 我是否需要将集群映射到从训练中学习到的测试数据?
DBSCAN 将为您处理这些异常值。这就是它的用途。请参阅下面的示例,如果您还有其他问题,请post返回。
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
bins = 50,
title = "Histogram of the age variable"
)
from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]
ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
x = "age",
y = "fare",
c = clusters,
cmap = cmap,
colorbar = False
)