将 numpy 矩阵分配给 pandas 列

Question

我有 48870 rows 的数据框和形状 (48870, 768)

的计算嵌入

我想将这个嵌入分配给 padnas 列当我尝试

test['original_text_embeddings'] = embeddings

我有一个错误：Wrong number of items passed 768, placement implies 1 我知道如果像 df.loc['original_text_embeddings'] = embeddings[0] 这样的东西会起作用，但我需要自动化这个过程

Answer 1

您的嵌入有 768 列，这将转化为数据框中的 768 列。您正在尝试将嵌入中的所有列分配给数据框中的一列，这是不可能的。

你可以做的是从嵌入生成一个新的数据帧并将测试 df 与嵌入 df 连接起来

embedding_df = pd.DataFrame(embeddings)

test = pd.concat([test, embedding_df], axis=1)

查看有关处理索引和在不同轴上串联的文档： https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Answer 2

A dataframe/column 需要 1d list/array:

In [84]: x = np.arange(12).reshape(3,4)
In [85]: pd.Series(x)
...
ValueError: Data must be 1-dimensional

将数组拆分为（数组的）列表：

In [86]: pd.Series(list(x))
Out[86]: 
0      [0, 1, 2, 3]
1      [4, 5, 6, 7]
2    [8, 9, 10, 11]
dtype: object
In [87]: _.to_numpy()
Out[87]: 
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8,  9, 10, 11])],
      dtype=object)

将 numpy 矩阵分配给 pandas 列

Assign numpy matrix to pandas columns

numpy

pandas