过采样时如何 keep/extend 索引
How to keep/extend index when oversample
我有一个这样的数据框,我想对“角色”列进行过采样(在实际情况下,rows/columns 的数量比这个最小示例大得多)
role value
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 2
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_3 2 1
[...........]
Index: 20 entries, pop_13vdpn1_site_1 to pop_13vdpn1_site_1
Data columns (total 2 columns):
role 20 non-null int64
value 20 non-null int64
这就是我正在做的事情:
X,y = smote.fit_sample(df,df[['role']])
X
role value
0 1 1
1 1 1
2 1 2
3 1 1
4 1 1
5 1 2
6 1 1
7 2 1
8 2 1
[.........]
它有效,但问题是我需要保留索引(pop_13vdpn1_site_1,等等。)这可能吗?
首先,您需要处理 df 并将特征和目标标签拆分为 X_train
和 y_train
。
现在您可以进行过采样了:
X_train_over, y_train_over = smote.fit_sample(X_train, y_train)
最后从上面的输出创建一个数据框。例如,
X = pd.DataFrame(X_train_over, columns=X_train.columns)
y = pd.DataFrame(y_train_over, columns=y_train.columns)
我终于找到了解决方法(可能不是最优的)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_tmp = df.reset_index()
df_tmp['index'] = le.fit_transform(df_tmp['index'])
aa,bb = smote.fit_sample(df_tmp,df_tmp[['role']])
aa['index'] = le.inverse_transform(aa['index'])
aa.set_index('index')
下面应该做的。
import io
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
示例数据。
df = pd.read_csv(io.StringIO("""
role value
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 2
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_3 2 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 2
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_3 2 1
"""), sep="\s+", engine="python")
df = df.reset_index()
形状应该是 (40, 3):
df.shape
Smote 接受数组,因此我们需要定义 x 和 y 值。
X_train = np.array(df['role']).reshape(40,1)
y_train = np.array(df['value']).reshape(40,)
实际打击:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X,y = sm.fit_resample(X_train,y_train)
将给定的 X
和 y
放入 DataFrame 中:
ndf = pd.DataFrame({'role':X.reshape(68,), 'value':y})
改写原来的名字。
ndf['name'] = ndf['role'].apply(lambda x: 'pop_13vdpn1_site_'+str(x))
看看数据是不是比较均衡
from collections import Counter
Counter(df['role'])
Counter(ndf['role'])
我有一个这样的数据框,我想对“角色”列进行过采样(在实际情况下,rows/columns 的数量比这个最小示例大得多)
role value
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 2
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_3 2 1
[...........]
Index: 20 entries, pop_13vdpn1_site_1 to pop_13vdpn1_site_1
Data columns (total 2 columns):
role 20 non-null int64
value 20 non-null int64
这就是我正在做的事情:
X,y = smote.fit_sample(df,df[['role']])
X
role value
0 1 1
1 1 1
2 1 2
3 1 1
4 1 1
5 1 2
6 1 1
7 2 1
8 2 1
[.........]
它有效,但问题是我需要保留索引(pop_13vdpn1_site_1,等等。)这可能吗?
首先,您需要处理 df 并将特征和目标标签拆分为 X_train
和 y_train
。
现在您可以进行过采样了:
X_train_over, y_train_over = smote.fit_sample(X_train, y_train)
最后从上面的输出创建一个数据框。例如,
X = pd.DataFrame(X_train_over, columns=X_train.columns)
y = pd.DataFrame(y_train_over, columns=y_train.columns)
我终于找到了解决方法(可能不是最优的)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_tmp = df.reset_index()
df_tmp['index'] = le.fit_transform(df_tmp['index'])
aa,bb = smote.fit_sample(df_tmp,df_tmp[['role']])
aa['index'] = le.inverse_transform(aa['index'])
aa.set_index('index')
下面应该做的。
import io
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
示例数据。
df = pd.read_csv(io.StringIO("""
role value
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 2
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_3 2 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 2
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_3 2 1
"""), sep="\s+", engine="python")
df = df.reset_index()
形状应该是 (40, 3):
df.shape
Smote 接受数组,因此我们需要定义 x 和 y 值。
X_train = np.array(df['role']).reshape(40,1)
y_train = np.array(df['value']).reshape(40,)
实际打击:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X,y = sm.fit_resample(X_train,y_train)
将给定的 X
和 y
放入 DataFrame 中:
ndf = pd.DataFrame({'role':X.reshape(68,), 'value':y})
改写原来的名字。
ndf['name'] = ndf['role'].apply(lambda x: 'pop_13vdpn1_site_'+str(x))
看看数据是不是比较均衡
from collections import Counter
Counter(df['role'])
Counter(ndf['role'])