one-hot编码，访问列表元素

Question

我有一个 .csv 文件，其中包含我想将其中一些列转换为单列的数据。问题出现在倒数第二行，其中单热索引（例如第一个特征）被放置在所有行中，而不仅仅是我当前所在的行。我如何访问 2D 列表似乎有些问题...有什么建议吗？谢谢

def one_hot_encode(data_list, column):
    one_hot_list = [[]]
    different_elements = []

    for row in data_list[1:]:                  # count different elements
        if row[column] not in different_elements:
            different_elements.append(row[column])

    for i in range(len(different_elements)):   # set variable names
        one_hot_list[0].append(different_elements[i])

    vector = []                              # create list shape with zeroes
    for i in range(len(different_elements)):
        vector.append(0)
    for i in range(1460):
        one_hot_list.append(vector)

    ind_row = 1                                # encode 1 for each sample
    for row in data_list[1:]:
        index = different_elements.index(row[column])
        one_hot_list[ind_row][index] = 1     # mistake!! sets all rows to 1
        ind_row += 1

Answer 1

您的问题源于您为执行一次性编码而创建的 vector 对象；您已经创建了一个对象，然后构建了一个包含 1460 个对同一对象的引用的 one_hot_list。当您在其中一行中进行更改时，它会反映在所有行中。

快速解决方案是为每一行创建单独的 vector 副本（参见 How to clone or copy a list?）：

one_hot_list.append(vector[:])

你在你的函数中做的其他一些事情有点慢或迂回。我建议进行一些更改：

def one_hot_encode(data_list, column):
    one_hot_list = [[]]

    # count different elements
    different_elements = set(row[column] for row in data_list[1:])

    # convert different_elements to a list with a canonical order,
    # store in the first element of one_hot_list
    one_hot_list[0] = sorted(different_elements)

    vector = [0] * len(different_elements)   # create list shape with zeroes
    one_hot_list.extend([vector[:] for _ in range(1460)])

    # build a mapping of different_element values to indices into
    # one_hot_list[0]
    index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
    # encode 1 for each sample
    for rindex, row in enumerate(data_list[1:], 1):
        cindex = index_lookup[row[column]]
        one_hot_list[rindex][cindex] = 1

这通过使用 set 数据类型在线性时间内构建 different_elements，并使用列表理解生成 one_hot_list[0] 的值（元素值列表是一个-热编码），零 vector 和 one_hot_list[1:]（这是实际的单热编码矩阵值）。此外，还有一个名为 index_lookup 的 dict 可让您快速将元素值映射到它们的整数索引，而不是一遍又一遍地搜索它们。最后，one_hot_list 矩阵中的行索引可以由 enumerate.

为您管理

Answer 2

我不是 100% 确定你想做什么，但你看到的问题是在这些行中：

for i in range(1460):
    one_hot_list.append(vector)

这些正在创建 one_hot_list 作为对同一零向量的 1460 个引用。而我认为你每次都希望它成为一个新的向量。一个直接的解决方法就是每次都复制它：

for i in range(1460):
    one_hot_list.append(vector[:])

但更 Pythonic 的方法是创建具有理解力的列表。也许是这样的：

vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]

Answer 3

您可以使用 set() 来计算列表中的唯一项目

 different_elements = list(set(data[1:]))

Answer 4

我建议您避免在简单的 Python 中重新实现它的麻烦。您可以为此使用 pandas.get_dummies：

首先是一些测试数据(test.csv):

A
Foo
Bar
Baz

然后在 Python:

import pandas as pd

df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])

您可以使用以下方法检索基础 numpy 数组：

pd.get_dummies(df['A']).values

这导致：

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]], dtype=uint8)

one-hot编码，访问列表元素

one-hot encoding, access list elements

python

arrays

list

one-hot-encoding