基于列的sklearn分层抽样
sklearn stratified sampling based on a column
我有一个相当大的 CSV 文件,其中包含我读入 pandas 数据框的亚马逊评论数据。我想将数据拆分为 80-20(训练测试),但在这样做时我想确保拆分数据按比例代表一列(类别)的值,即所有不同类别的评论都出现在火车中并按比例测试数据。
数据如下所示:
**ReviewerID** **ReviewText** **Categories** **ProductId**
1212 good product Mobile 14444425
1233 will buy again drugs 324532
5432 not recomended dvd 789654123
我正在使用以下代码来执行此操作:
import pandas as pd
Meta = pd.read_csv('C:\Users\xyz\Desktop\WM Project\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split
train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)
它给出以下错误
NameError: name 'y' is not defined
由于我对 python 还比较陌生,所以我不知道我做错了什么,也不知道这段代码是否会根据列类别进行分层。当我从 train-test split 中删除分层选项和类别列时,它似乎工作正常。
任何帮助将不胜感激。
sklearn.model_selection.train_test_split
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
根据 API 文档,我认为您必须尝试 X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y)
。
Meta_X
, Meta_Y
应该由你正确分配(我认为 Meta_Y
根据你的代码应该是 Meta.categories
)。
>>> import pandas as pd
>>> Meta = pd.read_csv('C:\Users\*****\Downloads\so\Book1.csv')
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> y = Meta.pop('Categories')
>>> Meta
ReviewerID ReviewText ProductId
0 1212 good product 14444425
1 1233 will buy again 324532
2 5432 not recomended 789654123
>>> y
0 Mobile
1 drugs
2 dvd
Name: Categories, dtype: object
>>> X = Meta
>>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
>>> X_test
ReviewerID ReviewText ProductId
0 1212 good product 14444425
我不确定为什么没有人提到 StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['Categories']):
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
有关文档,请参阅 StratifiedShuffleSplit
我有一个相当大的 CSV 文件,其中包含我读入 pandas 数据框的亚马逊评论数据。我想将数据拆分为 80-20(训练测试),但在这样做时我想确保拆分数据按比例代表一列(类别)的值,即所有不同类别的评论都出现在火车中并按比例测试数据。
数据如下所示:
**ReviewerID** **ReviewText** **Categories** **ProductId**
1212 good product Mobile 14444425
1233 will buy again drugs 324532
5432 not recomended dvd 789654123
我正在使用以下代码来执行此操作:
import pandas as pd
Meta = pd.read_csv('C:\Users\xyz\Desktop\WM Project\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split
train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)
它给出以下错误
NameError: name 'y' is not defined
由于我对 python 还比较陌生,所以我不知道我做错了什么,也不知道这段代码是否会根据列类别进行分层。当我从 train-test split 中删除分层选项和类别列时,它似乎工作正常。
任何帮助将不胜感激。
sklearn.model_selection.train_test_split
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
根据 API 文档,我认为您必须尝试 X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y)
。
Meta_X
, Meta_Y
应该由你正确分配(我认为 Meta_Y
根据你的代码应该是 Meta.categories
)。
>>> import pandas as pd
>>> Meta = pd.read_csv('C:\Users\*****\Downloads\so\Book1.csv')
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> y = Meta.pop('Categories')
>>> Meta
ReviewerID ReviewText ProductId
0 1212 good product 14444425
1 1233 will buy again 324532
2 5432 not recomended 789654123
>>> y
0 Mobile
1 drugs
2 dvd
Name: Categories, dtype: object
>>> X = Meta
>>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
>>> X_test
ReviewerID ReviewText ProductId
0 1212 good product 14444425
我不确定为什么没有人提到 StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['Categories']):
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
有关文档,请参阅 StratifiedShuffleSplit