在 Python 中根据特征列表生成数据集向量
Generate vectors of dataset based on a feature list, in Python
我需要根据数据集的特征总量为数据集中的每个样本生成一个向量。
# Assume the dataset has 6 features
features = ['a', 'b', 'c', 'd', 'e', 'f']
# Examples:
s1 = ['a', 'b', 'c']
# For s1, I want to generate a vector to represent features
r1 = [1, 1, 1, 0, 0, 0]
s2 = ['a', 'c', 'f']
# For s2 then the vector should be
r2 = [1, 0, 1, 0, 0, 1]
是否有任何 python 库可以完成这项任务?如果没有,我该如何完成?
可能不是最优化的,但如果你想为数据集中的每个样本创建一个向量,你只需为 0 到 2 之间的每个数字创建一个二进制数组6:
features = ['a', 'b', 'c', 'd', 'e', 'f']
l = len(features)
vectors = [[int(y) for y in f'{x:0{l}b}'] for x in range(2 ** l)]
print(vectors);
这非常简单明了,不需要库。
纯Python溶液
features = ['a', 'b', 'c', 'd', 'e', 'f']
features_lookup = dict(map(reversed, enumerate(features)))
s1 = ['a', 'b', 'c']
s2 = ['a', 'c', 'f']
def create_feature_vector(sample, lookup):
vec = [0]*len(lookup)
for value in sample:
vec[lookup[value]] = 1
return vec
输出:
>>> create_feature_vector(s1, features_lookup)
[1, 1, 1, 0, 0, 0]
>>> create_feature_vector(s2, features_lookup)
[1, 0, 1, 0, 0, 1]
单个特征向量的 Numpy 替代方案
如果您碰巧已经在使用 numpy,如果您的功能集很大,这会 多 更有效:
import numpy as np
features = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
sample_size = 3
def feature_sample_and_vector(sample_size, features):
n = features.size
sample_indices = np.random.choice(range(n), sample_size, replace=False)
sample = features[sample_indices]
vector = np.zeros(n, dtype="uint8")
vector[sample_indices] = 1
return sample, vector
大量样本及其特征向量的 Numpy 替代方案
使用 numpy 可以让我们很好地扩展大型特征集 and/or 大型样本集。请注意,此方法会产生重复样本:
import random
import numpy as np
# Assumes features is already a numpy array
def generate_samples(features, num_samples, sample_size):
n = features.size
vectors = np.zeros((num_samples, n), dtype="uint8")
idxs = [random.sample(range(n), k=sample_size) for _ in range(num_samples)]
cols = np.sort(np.array(idxs), axis=1) # You can remove the sort if having the features in order isn't important
rows = np.repeat(np.arange(num_samples).reshape(-1, 1), sample_size, axis=1)
vectors[rows, cols] = 1
samples = features[cols]
return samples, vectors
演示:
>>> generate_samples(features, 10, 3)
(array([['d', 'e', 'f'],
['a', 'b', 'c'],
['c', 'd', 'e'],
['c', 'd', 'f'],
['a', 'b', 'f'],
['a', 'e', 'f'],
['c', 'd', 'f'],
['b', 'e', 'f'],
['b', 'd', 'f'],
['a', 'c', 'e']], dtype='<U1'),
array([[0, 0, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 1],
[1, 1, 0, 0, 0, 1],
[1, 0, 0, 0, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 1, 0, 0, 1, 1],
[0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0]], dtype=uint8))
一个非常简单的时间基准,用于 26 个特征的特征集中的 100,000 个大小为 12 的样本:
In [2]: features = np.array(list("abcdefghijklmnopqrstuvwxyz"))
In [3]: num_samples = 100000
In [4]: sample_size = 12
In [5]: %timeit generate_samples(features, num_samples, sample_size)
645 ms ± 9.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
唯一真正的瓶颈是生成索引所需的列表理解。不幸的是,没有使用 np.random.choice()
生成样本而无需替换的二维变体,因此您仍然必须求助于一种相对较慢的方法来生成随机样本索引。
我需要根据数据集的特征总量为数据集中的每个样本生成一个向量。
# Assume the dataset has 6 features
features = ['a', 'b', 'c', 'd', 'e', 'f']
# Examples:
s1 = ['a', 'b', 'c']
# For s1, I want to generate a vector to represent features
r1 = [1, 1, 1, 0, 0, 0]
s2 = ['a', 'c', 'f']
# For s2 then the vector should be
r2 = [1, 0, 1, 0, 0, 1]
是否有任何 python 库可以完成这项任务?如果没有,我该如何完成?
可能不是最优化的,但如果你想为数据集中的每个样本创建一个向量,你只需为 0 到 2 之间的每个数字创建一个二进制数组6:
features = ['a', 'b', 'c', 'd', 'e', 'f']
l = len(features)
vectors = [[int(y) for y in f'{x:0{l}b}'] for x in range(2 ** l)]
print(vectors);
这非常简单明了,不需要库。
纯Python溶液
features = ['a', 'b', 'c', 'd', 'e', 'f']
features_lookup = dict(map(reversed, enumerate(features)))
s1 = ['a', 'b', 'c']
s2 = ['a', 'c', 'f']
def create_feature_vector(sample, lookup):
vec = [0]*len(lookup)
for value in sample:
vec[lookup[value]] = 1
return vec
输出:
>>> create_feature_vector(s1, features_lookup)
[1, 1, 1, 0, 0, 0]
>>> create_feature_vector(s2, features_lookup)
[1, 0, 1, 0, 0, 1]
单个特征向量的 Numpy 替代方案
如果您碰巧已经在使用 numpy,如果您的功能集很大,这会 多 更有效:
import numpy as np
features = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
sample_size = 3
def feature_sample_and_vector(sample_size, features):
n = features.size
sample_indices = np.random.choice(range(n), sample_size, replace=False)
sample = features[sample_indices]
vector = np.zeros(n, dtype="uint8")
vector[sample_indices] = 1
return sample, vector
大量样本及其特征向量的 Numpy 替代方案
使用 numpy 可以让我们很好地扩展大型特征集 and/or 大型样本集。请注意,此方法会产生重复样本:
import random
import numpy as np
# Assumes features is already a numpy array
def generate_samples(features, num_samples, sample_size):
n = features.size
vectors = np.zeros((num_samples, n), dtype="uint8")
idxs = [random.sample(range(n), k=sample_size) for _ in range(num_samples)]
cols = np.sort(np.array(idxs), axis=1) # You can remove the sort if having the features in order isn't important
rows = np.repeat(np.arange(num_samples).reshape(-1, 1), sample_size, axis=1)
vectors[rows, cols] = 1
samples = features[cols]
return samples, vectors
演示:
>>> generate_samples(features, 10, 3)
(array([['d', 'e', 'f'],
['a', 'b', 'c'],
['c', 'd', 'e'],
['c', 'd', 'f'],
['a', 'b', 'f'],
['a', 'e', 'f'],
['c', 'd', 'f'],
['b', 'e', 'f'],
['b', 'd', 'f'],
['a', 'c', 'e']], dtype='<U1'),
array([[0, 0, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 1],
[1, 1, 0, 0, 0, 1],
[1, 0, 0, 0, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 1, 0, 0, 1, 1],
[0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0]], dtype=uint8))
一个非常简单的时间基准,用于 26 个特征的特征集中的 100,000 个大小为 12 的样本:
In [2]: features = np.array(list("abcdefghijklmnopqrstuvwxyz"))
In [3]: num_samples = 100000
In [4]: sample_size = 12
In [5]: %timeit generate_samples(features, num_samples, sample_size)
645 ms ± 9.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
唯一真正的瓶颈是生成索引所需的列表理解。不幸的是,没有使用 np.random.choice()
生成样本而无需替换的二维变体,因此您仍然必须求助于一种相对较慢的方法来生成随机样本索引。