从递归特征消除 (RFE) 中提取最佳特征
Extract Optimal Features from Recursive Feature Elimination (RFE)
我有一个由具有 124 个特征的分类和数值数据组成的数据集。为了降低其维度,我想删除不相关的功能。然而,为了 运行 数据集针对特征选择算法,我用 get_dummies 对其进行了热编码,这将特征数量增加到 391.
In[16]:
X_train.columns
Out[16]:
Index([u'port_7', u'port_9', u'port_13', u'port_17', u'port_19', u'port_21',
...
u'os_cpes.1_2', u'os_cpes.1_1'], dtype='object', length=391)
,根据结果数据,我可以 运行 通过交叉验证消除递归特征
产生:
Cross Validated Score vs Features Graph
鉴于识别的最佳特征数是 8,我如何识别特征名称?我假设我可以将它们提取到一个新的 DataFrame 中用于分类算法?
[编辑]
在 :
的帮助下,我实现了如下目标
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols, query_cols, sorter = sidx)]
feature_index = []
features = []
column_index(X_dev_train, X_dev_train.columns.values)
for num, i in enumerate(rfecv.get_support(), start=0):
if i == True:
feature_index.append(str(num))
for num, i in enumerate(X_dev_train.columns.values, start=0):
if str(num) in feature_index:
features.append(X_dev_train.columns.values[num])
print("Features Selected: {}\n".format(len(feature_index)))
print("Features Indexes: \n{}\n".format(feature_index))
print("Feature Names: \n{}".format(features))
产生:
Features Selected: 8
Features Indexes:
['5', '6', '20', '26', '27', '28', '67', '98']
Feature Names:
['port_21', 'port_22', 'port_199', 'port_512', 'port_513', 'port_514', 'port_3306', 'port_32768']
鉴于一种热编码引入了多重共线性,我认为目标列选择不是理想的,因为它选择的特征是非编码的连续数据特征。我尝试重新添加未编码的目标列,但 RFE 抛出以下错误,因为数据是分类的:
ValueError: could not convert string to float: Wireless Access Point
我是否需要将多个一个热编码特征列组合起来作为目标?
[编辑 2]
如果我只是对目标列进行 LabelEncode,我可以将此目标用作 'y' 参见 example again。但是,输出仅将单个特征(目标列)确定为最佳特征。我认为这可能是因为单一的热编码,我是否应该考虑生成一个密集数组,如果是这样,它可以 运行 对抗 RFE 吗?
谢谢,
亚当
回答我自己的问题时,我发现问题与我对数据进行单热编码的方式有关。最初,我 运行 针对所有分类列进行如下热编码:
ohe_df = pd.get_dummies(df[df.columns]) # One-hot encode all columns
这引入了大量附加功能。采用不同的方法,在 的帮助下,我修改了编码以在 per-column/feature 的基础上对多列进行编码,如下所示:
cf_df = df.select_dtypes(include=[object]) # Get categorical features
nf_df = df.select_dtypes(exclude=[object]) # Get numerical features
ohe_df = nf_df.copy()
for feature in cf_df:
ohe_df[feature] = ohe_df.loc[:,(feature)].str.get_dummies().values.tolist()
制作中:
ohe_df.head(2) # Only showing a subset of the data
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| | os_name | os_family | os_type | os_vendor | os_cpes.0 |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| 0 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 1, 0, 0, 0] | [1, 0, 0, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... |
| 1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 1, 0] | [0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
不幸的是,尽管这是我要搜索的内容,但它并未针对 RFECV 执行。接下来我想也许我可以从所有新功能中提取一部分并将它们作为目标传递进来,但这导致了错误。最后,我意识到我必须遍历所有目标值并从每个目标值中获取最高输出。代码最终看起来像这样:
for num, feature in enumerate(features, start=0):
X = X_dev_train
y = X_dev_train[feature]
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
# step is the number of features to remove at each iteration
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(kfold), scoring='accuracy')
try:
rfecv.fit(X, y)
print("Number of observations in each fold: {}".format(len(X)/kfold))
print("Optimal number of features : {}".format(rfecv.n_features_))
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for num2, f in enumerate(range(X.shape[1]), start=0):
if g_scores[indices[f]] > 0.80:
if num2 < 10:
print("{}. Number of features: {} Grid_Score: {:0.3f}".format(f + 1, indices[f]+1, g_scores[indices[f]]))
print "\nTop features sorted by rank:"
results = sorted(zip(map(lambda x: round(x, 4), rfecv.ranking_), X.columns.values))
for num3, i in enumerate(results, start=0):
if num3 < 10:
print i
# Plot number of features VS. cross-validation scores
plt.rc("figure", figsize=(8, 5))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("CV score (of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
except ValueError:
pass
我相信这可以更清晰,甚至可以绘制在一张图中,但它对我有用。
干杯,
你可以这样做:
`
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 5)
rfe = rfe.fit(X, y)
print(rfe.support_)
print(rfe.ranking_)
f = rfe.get_support(1) #the most important features
X = df[df.columns[f]] # final features`
然后您可以在神经网络或任何算法中使用 X 作为输入
我有一个由具有 124 个特征的分类和数值数据组成的数据集。为了降低其维度,我想删除不相关的功能。然而,为了 运行 数据集针对特征选择算法,我用 get_dummies 对其进行了热编码,这将特征数量增加到 391.
In[16]:
X_train.columns
Out[16]:
Index([u'port_7', u'port_9', u'port_13', u'port_17', u'port_19', u'port_21',
...
u'os_cpes.1_2', u'os_cpes.1_1'], dtype='object', length=391)
,根据结果数据,我可以 运行 通过交叉验证消除递归特征
产生:
Cross Validated Score vs Features Graph
鉴于识别的最佳特征数是 8,我如何识别特征名称?我假设我可以将它们提取到一个新的 DataFrame 中用于分类算法?
[编辑]
在
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols, query_cols, sorter = sidx)]
feature_index = []
features = []
column_index(X_dev_train, X_dev_train.columns.values)
for num, i in enumerate(rfecv.get_support(), start=0):
if i == True:
feature_index.append(str(num))
for num, i in enumerate(X_dev_train.columns.values, start=0):
if str(num) in feature_index:
features.append(X_dev_train.columns.values[num])
print("Features Selected: {}\n".format(len(feature_index)))
print("Features Indexes: \n{}\n".format(feature_index))
print("Feature Names: \n{}".format(features))
产生:
Features Selected: 8
Features Indexes:
['5', '6', '20', '26', '27', '28', '67', '98']
Feature Names:
['port_21', 'port_22', 'port_199', 'port_512', 'port_513', 'port_514', 'port_3306', 'port_32768']
鉴于一种热编码引入了多重共线性,我认为目标列选择不是理想的,因为它选择的特征是非编码的连续数据特征。我尝试重新添加未编码的目标列,但 RFE 抛出以下错误,因为数据是分类的:
ValueError: could not convert string to float: Wireless Access Point
我是否需要将多个一个热编码特征列组合起来作为目标?
[编辑 2]
如果我只是对目标列进行 LabelEncode,我可以将此目标用作 'y' 参见 example again。但是,输出仅将单个特征(目标列)确定为最佳特征。我认为这可能是因为单一的热编码,我是否应该考虑生成一个密集数组,如果是这样,它可以 运行 对抗 RFE 吗?
谢谢,
亚当
回答我自己的问题时,我发现问题与我对数据进行单热编码的方式有关。最初,我 运行 针对所有分类列进行如下热编码:
ohe_df = pd.get_dummies(df[df.columns]) # One-hot encode all columns
这引入了大量附加功能。采用不同的方法,在
cf_df = df.select_dtypes(include=[object]) # Get categorical features
nf_df = df.select_dtypes(exclude=[object]) # Get numerical features
ohe_df = nf_df.copy()
for feature in cf_df:
ohe_df[feature] = ohe_df.loc[:,(feature)].str.get_dummies().values.tolist()
制作中:
ohe_df.head(2) # Only showing a subset of the data
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| | os_name | os_family | os_type | os_vendor | os_cpes.0 |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| 0 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 1, 0, 0, 0] | [1, 0, 0, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... |
| 1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 1, 0] | [0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
不幸的是,尽管这是我要搜索的内容,但它并未针对 RFECV 执行。接下来我想也许我可以从所有新功能中提取一部分并将它们作为目标传递进来,但这导致了错误。最后,我意识到我必须遍历所有目标值并从每个目标值中获取最高输出。代码最终看起来像这样:
for num, feature in enumerate(features, start=0):
X = X_dev_train
y = X_dev_train[feature]
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
# step is the number of features to remove at each iteration
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(kfold), scoring='accuracy')
try:
rfecv.fit(X, y)
print("Number of observations in each fold: {}".format(len(X)/kfold))
print("Optimal number of features : {}".format(rfecv.n_features_))
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for num2, f in enumerate(range(X.shape[1]), start=0):
if g_scores[indices[f]] > 0.80:
if num2 < 10:
print("{}. Number of features: {} Grid_Score: {:0.3f}".format(f + 1, indices[f]+1, g_scores[indices[f]]))
print "\nTop features sorted by rank:"
results = sorted(zip(map(lambda x: round(x, 4), rfecv.ranking_), X.columns.values))
for num3, i in enumerate(results, start=0):
if num3 < 10:
print i
# Plot number of features VS. cross-validation scores
plt.rc("figure", figsize=(8, 5))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("CV score (of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
except ValueError:
pass
我相信这可以更清晰,甚至可以绘制在一张图中,但它对我有用。
干杯,
你可以这样做:
`
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 5)
rfe = rfe.fit(X, y)
print(rfe.support_)
print(rfe.ranking_)
f = rfe.get_support(1) #the most important features
X = df[df.columns[f]] # final features`
然后您可以在神经网络或任何算法中使用 X 作为输入