在 Python 中对稀疏数据集进行过采样
Oversampling a sparse dataset in Python
我有一个包含多标签数据的数据集。共有 20 个标签(从 0 到 20),它们之间分布不平衡。以下是数据概览:
|id |label|value |
|-----|-----|------------|
|95534|0 |65.250002088|
|95535|18 | |
|95536|0 | |
|95536|0 |100 |
|95536|0 | |
|95536|0 |53.68547236 |
|95536|0 | |
|95537|1 | |
|95538|0 | |
|95538|0 | |
|95538|0 | |
|95538|0 |656.06155202|
|95538|0 | |
|95539|2 | |
|5935 |0 | |
|5935 |0 |150 |
|5935 |0 |50 |
|5935 |0 |24.610985335|
|5935 |0 | |
|5935 |0 |223.81789584|
|5935 |0 |148.1805218 |
|5935 |0 |110.9712538 |
|34147|19 |73.62651909 |
|34147|19 | |
|34147|19 |53.35958016 |
|34147|19 | |
|34147|19 | |
|34147|19 | |
|34147|19 |393.54029411|
我希望对数据进行过采样并在标签之间取得平衡。我遇到了一些方法,例如 SMOTE
和 SMOTENC
,但它们都需要将数据拆分为训练集和测试集,并且它们不适用于稀疏数据。有什么办法可以在拆分前的预处理步骤中对整个数据执行此操作?
实际上从理论上讲,您不需要对测试集进行上采样。
在 class 不平衡设置中,人为地平衡 test/validation 集没有任何意义:这些集必须保持真实,即你想测试你的 classifier 在现实世界的设置,比如说,负数 class 将包括 99% 的样本,以便查看您的模型在预测感兴趣的 1% 正数 class 且没有太多样本时的表现如何误报。人为地夸大少数 class 或减少多数将导致性能指标不切实际,与您试图解决的现实世界问题没有任何实际关系。
Re-balancing 仅在训练集中有意义,以防止 classifier 简单而天真地 class 将所有实例都视为负面的,以获得 99% 的感知准确度.
因此,您可以放心,在您描述的设置中,再平衡仅对训练采取行动 set/folds。
要对行进行抽样,以便每个 label
以相等的概率进行抽样:
- 绘制给定标签一行的概率应该是
1/n_labels
- 对于给定标签
l
绘制给定行的概率应该是 1/n_rows
对于该标签 n_rows
每一行的概率是 p_row = 1/(n_labels*n_rows)
。您可以使用 groupby 生成这些值并将它们传递给 df.sample,如下所示:
import numpy as np
import pandas as pd
df_dict = {'id': {0: 95535, 1: 95536, 2: 95536, 3: 95536, 4: 95536, 5: 95536, 6: 95537, 7: 95538, 8: 95538, 9: 95538, 10: 95538, 11: 95538, 12: 95539, 13: 5935, 14: 5935, 15: 5935, 16: 5935, 17: 5935, 18: 5935, 19: 5935, 20: 5935, 21: 34147, 22: 34147, 23: 34147, 24: 34147, 25: 34147, 26: 34147, 27: 34147}, 'label': {0: 18, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 2, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 19, 22: 19, 23: 19, 24: 19, 25: 19, 26: 19, 27: 19}, 'value': {0: ' ', 1: ' ', 2: '100 ', 3: ' ', 4: '53.68547236 ', 5: ' ', 6: ' ', 7: ' ', 8: ' ', 9: ' ', 10: '656.06155202', 11: ' ', 12: ' ', 13: ' ', 14: '150 ', 15: '50 ', 16: '24.610985335', 17: ' ', 18: '223.81789584', 19: '148.1805218 ', 20: '110.9712538 ', 21: '73.62651909 ', 22: ' ', 23: '53.35958016 ', 24: ' ', 25: ' ', 26: ' ', 27: '393.54029411'}}
df = pd.DataFrame.from_dict(d)
# create column that includes counts by label
n_labels = df.label.nunique()
n_rows = df.groupby("label").id.transform("count")
weights = 1/(n_rows*n_labels)
# sanity check probabilities:
bool(np.sum(weights) == 1)
df_samples = df.sample(n=40000, weights=weights, replace=True, random_state=19)
验证标签绘制是否大致均匀:
print(df_samples.label.value_counts()/len(df_samples))
# sampling frequency by group:
# 0 0.203325
# 2 0.201075
# 18 0.200925
# 19 0.198850
# 1 0.195825
我有一个包含多标签数据的数据集。共有 20 个标签(从 0 到 20),它们之间分布不平衡。以下是数据概览:
|id |label|value |
|-----|-----|------------|
|95534|0 |65.250002088|
|95535|18 | |
|95536|0 | |
|95536|0 |100 |
|95536|0 | |
|95536|0 |53.68547236 |
|95536|0 | |
|95537|1 | |
|95538|0 | |
|95538|0 | |
|95538|0 | |
|95538|0 |656.06155202|
|95538|0 | |
|95539|2 | |
|5935 |0 | |
|5935 |0 |150 |
|5935 |0 |50 |
|5935 |0 |24.610985335|
|5935 |0 | |
|5935 |0 |223.81789584|
|5935 |0 |148.1805218 |
|5935 |0 |110.9712538 |
|34147|19 |73.62651909 |
|34147|19 | |
|34147|19 |53.35958016 |
|34147|19 | |
|34147|19 | |
|34147|19 | |
|34147|19 |393.54029411|
我希望对数据进行过采样并在标签之间取得平衡。我遇到了一些方法,例如 SMOTE
和 SMOTENC
,但它们都需要将数据拆分为训练集和测试集,并且它们不适用于稀疏数据。有什么办法可以在拆分前的预处理步骤中对整个数据执行此操作?
实际上从理论上讲,您不需要对测试集进行上采样。
在 class 不平衡设置中,人为地平衡 test/validation 集没有任何意义:这些集必须保持真实,即你想测试你的 classifier 在现实世界的设置,比如说,负数 class 将包括 99% 的样本,以便查看您的模型在预测感兴趣的 1% 正数 class 且没有太多样本时的表现如何误报。人为地夸大少数 class 或减少多数将导致性能指标不切实际,与您试图解决的现实世界问题没有任何实际关系。
Re-balancing 仅在训练集中有意义,以防止 classifier 简单而天真地 class 将所有实例都视为负面的,以获得 99% 的感知准确度.
因此,您可以放心,在您描述的设置中,再平衡仅对训练采取行动 set/folds。
要对行进行抽样,以便每个 label
以相等的概率进行抽样:
- 绘制给定标签一行的概率应该是
1/n_labels
- 对于给定标签
l
绘制给定行的概率应该是1/n_rows
对于该标签 n_rows
每一行的概率是 p_row = 1/(n_labels*n_rows)
。您可以使用 groupby 生成这些值并将它们传递给 df.sample,如下所示:
import numpy as np
import pandas as pd
df_dict = {'id': {0: 95535, 1: 95536, 2: 95536, 3: 95536, 4: 95536, 5: 95536, 6: 95537, 7: 95538, 8: 95538, 9: 95538, 10: 95538, 11: 95538, 12: 95539, 13: 5935, 14: 5935, 15: 5935, 16: 5935, 17: 5935, 18: 5935, 19: 5935, 20: 5935, 21: 34147, 22: 34147, 23: 34147, 24: 34147, 25: 34147, 26: 34147, 27: 34147}, 'label': {0: 18, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 2, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 19, 22: 19, 23: 19, 24: 19, 25: 19, 26: 19, 27: 19}, 'value': {0: ' ', 1: ' ', 2: '100 ', 3: ' ', 4: '53.68547236 ', 5: ' ', 6: ' ', 7: ' ', 8: ' ', 9: ' ', 10: '656.06155202', 11: ' ', 12: ' ', 13: ' ', 14: '150 ', 15: '50 ', 16: '24.610985335', 17: ' ', 18: '223.81789584', 19: '148.1805218 ', 20: '110.9712538 ', 21: '73.62651909 ', 22: ' ', 23: '53.35958016 ', 24: ' ', 25: ' ', 26: ' ', 27: '393.54029411'}}
df = pd.DataFrame.from_dict(d)
# create column that includes counts by label
n_labels = df.label.nunique()
n_rows = df.groupby("label").id.transform("count")
weights = 1/(n_rows*n_labels)
# sanity check probabilities:
bool(np.sum(weights) == 1)
df_samples = df.sample(n=40000, weights=weights, replace=True, random_state=19)
验证标签绘制是否大致均匀:
print(df_samples.label.value_counts()/len(df_samples))
# sampling frequency by group:
# 0 0.203325
# 2 0.201075
# 18 0.200925
# 19 0.198850
# 1 0.195825