使用 scikit-learn 删除具有低方差的特征
Removing features with low variance using scikit-learn
scikit-learn 提供了多种删除描述符的方法,下面的教程提供了用于此目的的基本方法,
http://scikit-learn.org/stable/modules/feature_selection.html
但本教程未提供任何方法或方法来告诉您如何保留已删除或保留的功能列表。
以下代码摘自教程。
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
上面给出的示例代码仅描述了两个描述符 "shape(6, 2)",但在我的例子中,我有一个巨大的数据框,其形状为(第 51 行,第 9000 列)。找到合适的模型后,我想跟踪有用和无用的特征,因为在计算测试数据集的特征时,我可以通过只计算有用的特征来节省计算时间。
例如,当您使用 WEKA 6.0 执行机器学习建模时,它在特征选择方面提供了显着的灵活性,在删除无用特征后,您可以获得丢弃的特征列表以及有用的特征。
感谢
那么,如果我没记错的话,你可以做的是:
在 VarianceThreshold 的情况下,您可以调用方法 fit
而不是 fit_transform
。这将拟合数据,结果方差将存储在 vt.variances_
中(假设 vt
是您的对象)。
有了阈值,您可以像 fit_transform
那样提取转换的特征:
X[:, vt.variances_ > threshold]
或获取索引为:
idx = np.where(vt.variances_ > threshold)[0]
或作为面具
mask = vt.variances_ > threshold
PS:默认阈值为 0
编辑:
更直接的做法是使用 class VarianceThreshold
的方法 get_support
。来自文档:
get_support([indices]) Get a mask, or integer index, of the features selected
您应该在fit
或fit_transform
之后调用此方法。
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features
如果您想准确查看阈值处理后保留的列,这对我有用,您可以使用此方法:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]
在测试功能时,我编写了这个简单的函数,它告诉我应用 VarianceThreshold
后哪些变量保留在数据框中。
from sklearn.feature_selection import VarianceThreshold
from itertools import compress
def fs_variance(df, threshold:float=0.1):
"""
Return a list of selected variables based on the threshold.
"""
# The list of columns in the data frame
features = list(df.columns)
# Initialize and fit the method
vt = VarianceThreshold(threshold = threshold)
_ = vt.fit(df)
# Get which column names which pass the threshold
feat_select = list(compress(features, vt.get_support()))
return feat_select
which returns 已选择的列名列表。例如:['col_2','col_14', 'col_17']
.
scikit-learn 提供了多种删除描述符的方法,下面的教程提供了用于此目的的基本方法,
http://scikit-learn.org/stable/modules/feature_selection.html
但本教程未提供任何方法或方法来告诉您如何保留已删除或保留的功能列表。
以下代码摘自教程。
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
上面给出的示例代码仅描述了两个描述符 "shape(6, 2)",但在我的例子中,我有一个巨大的数据框,其形状为(第 51 行,第 9000 列)。找到合适的模型后,我想跟踪有用和无用的特征,因为在计算测试数据集的特征时,我可以通过只计算有用的特征来节省计算时间。
例如,当您使用 WEKA 6.0 执行机器学习建模时,它在特征选择方面提供了显着的灵活性,在删除无用特征后,您可以获得丢弃的特征列表以及有用的特征。
感谢
那么,如果我没记错的话,你可以做的是:
在 VarianceThreshold 的情况下,您可以调用方法 fit
而不是 fit_transform
。这将拟合数据,结果方差将存储在 vt.variances_
中(假设 vt
是您的对象)。
有了阈值,您可以像 fit_transform
那样提取转换的特征:
X[:, vt.variances_ > threshold]
或获取索引为:
idx = np.where(vt.variances_ > threshold)[0]
或作为面具
mask = vt.variances_ > threshold
PS:默认阈值为 0
编辑:
更直接的做法是使用 class VarianceThreshold
的方法 get_support
。来自文档:
get_support([indices]) Get a mask, or integer index, of the features selected
您应该在fit
或fit_transform
之后调用此方法。
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features
如果您想准确查看阈值处理后保留的列,这对我有用,您可以使用此方法:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]
在测试功能时,我编写了这个简单的函数,它告诉我应用 VarianceThreshold
后哪些变量保留在数据框中。
from sklearn.feature_selection import VarianceThreshold
from itertools import compress
def fs_variance(df, threshold:float=0.1):
"""
Return a list of selected variables based on the threshold.
"""
# The list of columns in the data frame
features = list(df.columns)
# Initialize and fit the method
vt = VarianceThreshold(threshold = threshold)
_ = vt.fit(df)
# Get which column names which pass the threshold
feat_select = list(compress(features, vt.get_support()))
return feat_select
which returns 已选择的列名列表。例如:['col_2','col_14', 'col_17']
.