sklearn train_test_split returns 两者中的一些元素 test/train
sklearn train_test_split returns some elements in both test/train
我有一个数据集 X
,其中包含 260 个独特的观察结果。
当 运行 x_train,x_test,_,_=test_train_split(X,y,test_size=0.2)
我会假设
[p for p in x_test if p in x_train]
会是空的,但它不是。实际上,只有 x_test
中的两个观察结果不在 x_train
.
中
这是故意的还是...?
编辑(发布我正在使用的数据):
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test if p in x_train]) #is not 0
EDIT 2.0:显示测试有效
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])
len([p for p in a if p in b]) #1
您需要使用以下内容进行检查:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test.tolist() if p in x_train.tolist()])
0
使用 x_test.tolist()
,in
运算符将按预期工作。
参考:testing whether a Numpy array contains a given row
这不是在 sklearn
中实现 train_test_split
的错误,而是 in
运算符在 numpy 数组上的工作方式的奇怪特性。 in
运算符首先在两个数组之间进行元素比较,如果有任何元素匹配,则 returns True
。
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5]])
a in b # True
测试这种重叠的正确方法是使用相等运算符和 np.all
和 np.any
。作为奖励,您还可以免费获得重叠的索引。
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5], [7, 8, 9]])
a in b # True
z = np.any(np.all(a == b[:, None, :], -1)) # False
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [1, 2, 3], [7, 8, 9]])
a in b # True
overlap = np.all(a == b[:, None, :], -1)
z = np.any(overlap) # True
indices = np.nonzero(overlap) # (1, 0)
我有一个数据集 X
,其中包含 260 个独特的观察结果。
当 运行 x_train,x_test,_,_=test_train_split(X,y,test_size=0.2)
我会假设
[p for p in x_test if p in x_train]
会是空的,但它不是。实际上,只有 x_test
中的两个观察结果不在 x_train
.
这是故意的还是...?
编辑(发布我正在使用的数据):
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test if p in x_train]) #is not 0
EDIT 2.0:显示测试有效
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])
len([p for p in a if p in b]) #1
您需要使用以下内容进行检查:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test.tolist() if p in x_train.tolist()])
0
使用 x_test.tolist()
,in
运算符将按预期工作。
参考:testing whether a Numpy array contains a given row
这不是在 sklearn
中实现 train_test_split
的错误,而是 in
运算符在 numpy 数组上的工作方式的奇怪特性。 in
运算符首先在两个数组之间进行元素比较,如果有任何元素匹配,则 returns True
。
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5]])
a in b # True
测试这种重叠的正确方法是使用相等运算符和 np.all
和 np.any
。作为奖励,您还可以免费获得重叠的索引。
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5], [7, 8, 9]])
a in b # True
z = np.any(np.all(a == b[:, None, :], -1)) # False
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [1, 2, 3], [7, 8, 9]])
a in b # True
overlap = np.all(a == b[:, None, :], -1)
z = np.any(overlap) # True
indices = np.nonzero(overlap) # (1, 0)