ValueError: n_splits=10 cannot be greater than the number of members in each class
ValueError: n_splits=10 cannot be greater than the number of members in each class
我正在尝试 运行 以下代码:
from sklearn.model_selection import StratifiedKFold
X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"]
skf = StratifiedKFold(n_splits=10)
for train, test in skf.split(X,y):
print("%s %s" % (train,test))
但是我收到以下错误:
ValueError: n_splits=10 cannot be greater than the number of members in each class.
我看过这里 scikit-learn error: The least populated class in y has only 1 member,但我仍然不确定我的代码有什么问题。
我的列表的长度都是 14 print(len(X))
print(len(y))
。
我的部分困惑是我不确定 members
在这种情况下的定义以及 class
的定义。
问题:如何修复错误?什么是会员?什么是 class? (在这种情况下)
分层是指保持每个折叠中每个 class 的比例。因此,如果您的原始数据集有 3 个 classes,比例分别为 60%、20% 和 20%,那么分层将尝试在每个折叠中保持该比例。
在你的情况下,
X = ["hey", "join now", "hello", "join today", "join us now", "not today",
"join this trial", " hey hey", " no", "hola", "bye", "join today",
"no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "y", "n", "n", "n", "y"]
您共有 14 个样本(成员),其分布为:
class number of members percentage
'n' 9 64
'r' 3 22
'y' 2 14
因此 StratifiedKFold 将尝试在每次折叠中保持该比例。现在您指定了 10 次折叠 (n_splits)。所以这意味着在一个折叠中,为了 class 'y' 保持比例,至少有 2 / 10 = 0.2 个成员。但是我们不能给少于 1 个成员(样本),所以这就是它在那里抛出错误的原因。
如果您设置的不是 n_splits=10
,而是 n_splits=2
,那么它会起作用,因为 'y' 的成员数将为 2 / 2 = 1。对于n_splits = 10
要正常工作,每个 classes 至少需要 10 个样本。
我正在尝试 运行 以下代码:
from sklearn.model_selection import StratifiedKFold
X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"]
skf = StratifiedKFold(n_splits=10)
for train, test in skf.split(X,y):
print("%s %s" % (train,test))
但是我收到以下错误:
ValueError: n_splits=10 cannot be greater than the number of members in each class.
我看过这里 scikit-learn error: The least populated class in y has only 1 member,但我仍然不确定我的代码有什么问题。
我的列表的长度都是 14 print(len(X))
print(len(y))
。
我的部分困惑是我不确定 members
在这种情况下的定义以及 class
的定义。
问题:如何修复错误?什么是会员?什么是 class? (在这种情况下)
分层是指保持每个折叠中每个 class 的比例。因此,如果您的原始数据集有 3 个 classes,比例分别为 60%、20% 和 20%,那么分层将尝试在每个折叠中保持该比例。
在你的情况下,
X = ["hey", "join now", "hello", "join today", "join us now", "not today",
"join this trial", " hey hey", " no", "hola", "bye", "join today",
"no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "y", "n", "n", "n", "y"]
您共有 14 个样本(成员),其分布为:
class number of members percentage
'n' 9 64
'r' 3 22
'y' 2 14
因此 StratifiedKFold 将尝试在每次折叠中保持该比例。现在您指定了 10 次折叠 (n_splits)。所以这意味着在一个折叠中,为了 class 'y' 保持比例,至少有 2 / 10 = 0.2 个成员。但是我们不能给少于 1 个成员(样本),所以这就是它在那里抛出错误的原因。
如果您设置的不是 n_splits=10
,而是 n_splits=2
,那么它会起作用,因为 'y' 的成员数将为 2 / 2 = 1。对于n_splits = 10
要正常工作,每个 classes 至少需要 10 个样本。