Imblearn SMOTE:如何为多类不平衡数据集设置 sample_strategy 参数?
Imblearn SMOTE: How to set the sample_strategy parameter for a multiclass imbalance dataset?
我正在尝试处理具有以下形状的网络攻击数据集:
df.shape
(1074992, 42)
攻击和正常行为的标签有以下计数:
df['Label'].value_counts()
normal 812814
neptune 242149
satan 5019
ipsweep 3723
portsweep 3564
smurf 3007
nmap 1554
back 968
teardrop 918
warezclient 893
pod 206
guesspasswd 53
bufferoverflow 30
warezmaster 20
land 19
imap 12
rootkit 10
loadmodule 9
ftpwrite 8
multihop 7
phf 4
perl 3
spy 2
Name: Label, dtype: int64
接下来我将数据集拆分为特征和标签。
labels = df['Label']
features = df.loc[:, df.columns != 'Label'].astype('float64')
然后尝试平衡我的数据集。
print("Before UpSampling, counts of label Normal: {}".format(sum(labels == "normal")))
print("Before UpSampling, counts of label Attack: {} \n".format(sum(labels != "normal")))
Before UpSampling, counts of label Normal: 812814
Before UpSampling, counts of label Attack: 262178
所以你可以注意到攻击的数量与正常行为的数量不成比例。
我尝试使用 SMOTE 将少数(攻击)class 设置为与大多数 class(正常)相同的值。
sm = SMOTE(k_neighbors = 1,random_state= 42) #Synthetic Minority Over Sampling Technique
features_res, labels_res = sm.fit_resample(features, labels)
features_res.shape ,labels_res.shape
((18694722, 41), (18694722,))
我不明白的是为什么我在应用 SMOTE 后得到 18694722 个值。
print("After UpSampling, counts of label Normal: {}".format(sum(labels_res == "normal")))
print("After UpSampling, counts of label Attack: {} \n".format(sum(labels_res != "normal")))
After UpSampling, counts of label Normal: 812814
After UpSampling, counts of label Attack: 17881908
对于我的情况,是对正常 class 进行下采样还是对攻击 class 进行上采样更好?
关于如何正确执行此操作的任何想法?
非常感谢。
默认情况下 SMOTE 的 sampling_strategy
是 not majority
,
'not majority': resample all classes but the majority class
所以,如果大多数 class 的样本是 812814,你将有
(812814 * 23) = 18694722
样本。
尝试为少数 classes 传递一个包含所需样本数的字典。来自 docs
When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
例子
改编自 docs,在此示例中,我们对少数 class 中的一个进行上采样,使其具有与大多数 class.
相同数量的样本
from sklearn.datasets import make_classification
from collections import Counter
from imblearn.over_sampling import SMOTE
X, y = make_classification(n_classes=5,
class_sep=2,
weights=[0.15, 0.15, 0.1, 0.1, 0.5],
n_informative=4,
n_redundant=1,
flip_y=0,
n_features=20,
n_clusters_per_class=1,
n_samples=1000,
random_state=10)
sample_strategy = {4: 500, 0: 500, 1: 150, 2: 100, 3: 100}
sm = SMOTE(sampling_strategy=sample_strategy, random_state=0)
X_res, y_res = sm.fit_resample(X, y)
from collections import Counter
print('Resampled dataset shape %s' % Counter(y_res))
>>>
Resampled dataset shape Counter({4: 500, 0: 500, 1: 150, 3: 100, 2: 100})
我正在尝试处理具有以下形状的网络攻击数据集:
df.shape
(1074992, 42)
攻击和正常行为的标签有以下计数:
df['Label'].value_counts()
normal 812814
neptune 242149
satan 5019
ipsweep 3723
portsweep 3564
smurf 3007
nmap 1554
back 968
teardrop 918
warezclient 893
pod 206
guesspasswd 53
bufferoverflow 30
warezmaster 20
land 19
imap 12
rootkit 10
loadmodule 9
ftpwrite 8
multihop 7
phf 4
perl 3
spy 2
Name: Label, dtype: int64
接下来我将数据集拆分为特征和标签。
labels = df['Label']
features = df.loc[:, df.columns != 'Label'].astype('float64')
然后尝试平衡我的数据集。
print("Before UpSampling, counts of label Normal: {}".format(sum(labels == "normal")))
print("Before UpSampling, counts of label Attack: {} \n".format(sum(labels != "normal")))
Before UpSampling, counts of label Normal: 812814
Before UpSampling, counts of label Attack: 262178
所以你可以注意到攻击的数量与正常行为的数量不成比例。
我尝试使用 SMOTE 将少数(攻击)class 设置为与大多数 class(正常)相同的值。
sm = SMOTE(k_neighbors = 1,random_state= 42) #Synthetic Minority Over Sampling Technique
features_res, labels_res = sm.fit_resample(features, labels)
features_res.shape ,labels_res.shape
((18694722, 41), (18694722,))
我不明白的是为什么我在应用 SMOTE 后得到 18694722 个值。
print("After UpSampling, counts of label Normal: {}".format(sum(labels_res == "normal")))
print("After UpSampling, counts of label Attack: {} \n".format(sum(labels_res != "normal")))
After UpSampling, counts of label Normal: 812814
After UpSampling, counts of label Attack: 17881908
对于我的情况,是对正常 class 进行下采样还是对攻击 class 进行上采样更好? 关于如何正确执行此操作的任何想法?
非常感谢。
默认情况下 SMOTE 的 sampling_strategy
是 not majority
,
'not majority': resample all classes but the majority class
所以,如果大多数 class 的样本是 812814,你将有
(812814 * 23) = 18694722
样本。
尝试为少数 classes 传递一个包含所需样本数的字典。来自 docs
When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
例子
改编自 docs,在此示例中,我们对少数 class 中的一个进行上采样,使其具有与大多数 class.
相同数量的样本from sklearn.datasets import make_classification
from collections import Counter
from imblearn.over_sampling import SMOTE
X, y = make_classification(n_classes=5,
class_sep=2,
weights=[0.15, 0.15, 0.1, 0.1, 0.5],
n_informative=4,
n_redundant=1,
flip_y=0,
n_features=20,
n_clusters_per_class=1,
n_samples=1000,
random_state=10)
sample_strategy = {4: 500, 0: 500, 1: 150, 2: 100, 3: 100}
sm = SMOTE(sampling_strategy=sample_strategy, random_state=0)
X_res, y_res = sm.fit_resample(X, y)
from collections import Counter
print('Resampled dataset shape %s' % Counter(y_res))
>>>
Resampled dataset shape Counter({4: 500, 0: 500, 1: 150, 3: 100, 2: 100})