根据列将dataframe分成两组

Question

我有 Dataframe df 我选择了其中的一些库，我想根据名为 Sevrice 的库将它们分为 xtrain 和 xtest。因此，带有 1 和 o 的 raws 进入 xtrain，nan 进入 xtest。

Service
1
0
0
1
Nan
Nan

xtarin = df.loc[df['Service'].notnull(), ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]

已编辑

    ytrain = df['Service'].dropna()
    Xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    logistic = LogisticRegression()
    logistic.fit(xtrain, ytrain)
    logistic.predict(xtest)

我收到 logistic.predict(xtest)

的错误

X has 220 features per sample; expecting 307

Answer 1

我认为你需要 isnull:

Xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]

另一个解决方案是通过 ~ 反转 boolean mask:

mask = df['Service'].notnull()
xtarin = df.loc[mask, ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]
Xtest = df.loc[~mask, ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]

编辑：

df = pd.DataFrame({'Service':[1,0,np.nan,np.nan],
                   'Age':[4,5,6,5],
                   'Fare':[7,8,9,5],
                   'GSize':[1,3,5,7],
                   'Deck':[5,3,6,2],
                   'Class':[7,4,3,0],
                    'Profession_title':[6,7,4,6]})

print (df)
   Age  Class  Deck  Fare  GSize  Profession_title  Service
0    4      7     5     7      1                 6      1.0
1    5      4     3     8      3                 7      0.0
2    6      3     6     9      5                 4      NaN
3    5      0     2     5      7                 6      NaN

ytrain = df['Service'].dropna()
xtrain = df.loc[df['Service'].notnull(), ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]
xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]
import pandas as pd
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(xtrain, ytrain)
print (logistic.predict(xtest))
[ 0.  0.]

根据列将dataframe分成两组

Divide dataframe into two sets according to a column

python

pandas

logistic-regression

sklearn-pandas