Pandas 对参差不齐的顺序数据进行过采样
Pandas oversampling ragged sequential data
尝试使用 pandas 对我参差不齐的数据(具有不同长度的数据)进行过采样。
给定以下数据样本:
import pandas as pd
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})
数据(组间用---
分隔):
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
目标:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
我想平衡少数class。在上面的示例中,目标 1 是少数 class,有 2 个样本,id 1 和 3。
我正在寻找一种对数据进行过度采样的方法,因此结果将是:
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
-----------------
13 7 11
14 7 12 Replica of id 1
15 7 13
-----------------
16 8 33
17 8 34 Replica of id 3
18 8 35
19 8 36
并且目标将是平衡的:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
6 7 1
8 8 1
恰好有 4 个正样本和 4 个负样本。
您可以使用:
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],
'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
#more general sample
y = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
上面的解决方案仅适用于非平衡数据,如果可能有时平衡:
#balanced sample
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
if len(new) > 0:
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
else:
print ('y is already balanced')
尝试使用 pandas 对我参差不齐的数据(具有不同长度的数据)进行过采样。
给定以下数据样本:
import pandas as pd
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})
数据(组间用---
分隔):
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
目标:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
我想平衡少数class。在上面的示例中,目标 1 是少数 class,有 2 个样本,id 1 和 3。
我正在寻找一种对数据进行过度采样的方法,因此结果将是:
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
-----------------
13 7 11
14 7 12 Replica of id 1
15 7 13
-----------------
16 8 33
17 8 34 Replica of id 3
18 8 35
19 8 36
并且目标将是平衡的:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
6 7 1
8 8 1
恰好有 4 个正样本和 4 个负样本。
您可以使用:
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],
'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
#more general sample
y = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
上面的解决方案仅适用于非平衡数据,如果可能有时平衡:
#balanced sample
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
if len(new) > 0:
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
else:
print ('y is already balanced')