train_test_split 不拆分数据
train_test_split not splitting data
有一个dataframe,共有14列,最后一列是目标标签,整数值=0或1。
我定义了:
X = df.iloc[:,1:13]
---- 这包括特征值
y = df.iloc[:,-1]
------ 这由相应的标签组成
两者长度相同,X
是13列的dataframe,形状为(159880, 13),y
是数组类型,形状为(159880,)
但是当我在 X
上执行 train_test_split()
时,y
- 该功能无法正常工作。
下面是简单的代码:
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)
拆分后,X_train
和 X_test
的形状均为 (119910,13)。 y_train
的形状为 (39970,13) 而 y_test
的形状为 (39970,)
这很奇怪,即使定义了 test_size
参数,结果仍然保持不变。
请指教,可能出了什么问题。
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from adspy_shared_utilities import plot_feature_importances
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
def model():
df = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
df = df[np.isfinite(df['compliance'])]
df = df.fillna(0)
df['compliance'] = df['compliance'].astype('int')
df = df.drop(['grafitti_status', 'violation_street_number','violation_street_name','violator_name',
'inspector_name','mailing_address_str_name','mailing_address_str_number','payment_status',
'compliance_detail', 'collection_status','payment_date','disposition','violation_description',
'hearing_date','ticket_issued_date','mailing_address_str_name','city','state','country',
'violation_street_name','agency_name','violation_code'], axis=1)
df['violation_zip_code'] = df['violation_zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['zip_code'] = df['zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['non_us_str_code'] = df['non_us_str_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['violation_zip_code'] = pd.to_numeric(df['violation_zip_code'], errors='coerce')
df['zip_code'] = pd.to_numeric(df['zip_code'], errors='coerce')
df['non_us_str_code'] = pd.to_numeric(df['non_us_str_code'], errors='coerce')
#df.violation_zip_code = df.violation_zip_code.replace('-','', inplace=True)
df['violation_zip_code'] = np.nan_to_num(df['violation_zip_code'])
df['zip_code'] = np.nan_to_num(df['zip_code'])
df['non_us_str_code'] = np.nan_to_num(df['non_us_str_code'])
X = df.iloc[:,0:13]
y = df.iloc[:,-1]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)
print(y_train.shape)
你搞错了train_test_split的结果,应该是
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
if args.mode == "train":
# Load Data
data, labels = load_dataset('C:/Users/PC/Desktop/train/k')
# Train ML models
knn(data, labels,'C:/Users/PC/Desktop/train/knn.pkl' )
有一个dataframe,共有14列,最后一列是目标标签,整数值=0或1。
我定义了:
X = df.iloc[:,1:13]
---- 这包括特征值y = df.iloc[:,-1]
------ 这由相应的标签组成
两者长度相同,X
是13列的dataframe,形状为(159880, 13),y
是数组类型,形状为(159880,)
但是当我在 X
上执行 train_test_split()
时,y
- 该功能无法正常工作。
下面是简单的代码:
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)
拆分后,X_train
和 X_test
的形状均为 (119910,13)。 y_train
的形状为 (39970,13) 而 y_test
的形状为 (39970,)
这很奇怪,即使定义了 test_size
参数,结果仍然保持不变。
请指教,可能出了什么问题。
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from adspy_shared_utilities import plot_feature_importances
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
def model():
df = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
df = df[np.isfinite(df['compliance'])]
df = df.fillna(0)
df['compliance'] = df['compliance'].astype('int')
df = df.drop(['grafitti_status', 'violation_street_number','violation_street_name','violator_name',
'inspector_name','mailing_address_str_name','mailing_address_str_number','payment_status',
'compliance_detail', 'collection_status','payment_date','disposition','violation_description',
'hearing_date','ticket_issued_date','mailing_address_str_name','city','state','country',
'violation_street_name','agency_name','violation_code'], axis=1)
df['violation_zip_code'] = df['violation_zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['zip_code'] = df['zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['non_us_str_code'] = df['non_us_str_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['violation_zip_code'] = pd.to_numeric(df['violation_zip_code'], errors='coerce')
df['zip_code'] = pd.to_numeric(df['zip_code'], errors='coerce')
df['non_us_str_code'] = pd.to_numeric(df['non_us_str_code'], errors='coerce')
#df.violation_zip_code = df.violation_zip_code.replace('-','', inplace=True)
df['violation_zip_code'] = np.nan_to_num(df['violation_zip_code'])
df['zip_code'] = np.nan_to_num(df['zip_code'])
df['non_us_str_code'] = np.nan_to_num(df['non_us_str_code'])
X = df.iloc[:,0:13]
y = df.iloc[:,-1]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)
print(y_train.shape)
你搞错了train_test_split的结果,应该是
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
if args.mode == "train":
# Load Data
data, labels = load_dataset('C:/Users/PC/Desktop/train/k')
# Train ML models
knn(data, labels,'C:/Users/PC/Desktop/train/knn.pkl' )