提取与另一个 df 的单个元素相对应的数据帧的值

Question

我有 2 个 pandas dfs（df1 和 df2），如下所示：

df1	col1	col2	col3	col4	col5
row1	Dog	Cat	Bird	Tree	Lion
row2	Cat	Dragon	Bird	Dog	Tree
row3	Cat	Dog	Bird	Tree	Hippo
row4	Cat	Tree	Bird	Ant	Fish
row5	Cat	Tree	Monkey	Dragon	Ant

df2	col1	col2	col3	col4	col5
row1	3.219843	1.3631996	1.0051135	0.89303696	0.4313375
row2	2.8661892	1.4396228	0.7863044	0.539315	0.48167187
row3	2.5679462	1.3657334	0.9470184	0.79186934	0.48637152
row4	3.631389	0.94815284	0.7561722	0.6743943	0.5441728
row5	2.4727197	1.5941181	1.4069512	1.064051	0.48297918

df1的字符串元素对应于df2的值。对于这两个数据框，存在元素（或值）不在同一行上重复的条件。但可以在不同的行上重复。

例如第1行的狗= 3.219843，第3行的鸟= 0.9470184，第4行的鸟= 0.7561722等

我想将第一个 df 的所有 唯一元素 的值提取到不同的数组中。喜欢：

狗 = [3.219843, 0.539315, 1.3657334]

猫 = [1.3631996, 2.8661892, 2.5679462, 3.631389, 2.4727197]

等...

有什么想法吗？

非常感谢！

Answer 1

假设您的第一列 df1 和 df2 是它们各自 df 的索引，我们可以提取 df1 中每个独特动物的值，方法是使用第一个 df 作为掩码从第二个中提取所有想要的值（结果是一个新的 df 与 NaN 在不相关的单元格中，它可以变成一维数组.stack().values).

构建数据帧

首先，创建一些测试数据。请在以后的帖子中以这样的形式提供。这就是@mozway 在评论中所说的内容。是 greatly appreciated.

（并非总是有人愿意做所有必要的 copy-and-pasting 来启动数据帧并运行进行测试。）

import pandas as pd
import numpy as np

index = ['row1', 'row2', 'row3', 'row4', 'row5']

data1 = {'col1': ['Dog', 'Cat', 'Cat', 'Cat', 'Cat'],
         'col2': ['Cat', 'Dragon', 'Dog', 'Tree', 'Tree'],
         'col3': ['Bird', 'Bird', 'Bird', 'Bird', 'Monkey'],
         'col4': ['Tree', 'Dog', 'Tree', 'Ant', 'Dragon'],
         'col5': ['Lion', 'Tree', 'Hippo', 'Fish', 'Ant']}

data2 = {'col1': [3.219843, 2.8661892, 2.5679462, 3.631389, 2.4727197],
         'col2': [1.3631996, 1.4396228, 1.3657334, 0.94815284, 1.5941181],
         'col3': [1.0051135, 0.7863044, 0.9470184, 0.7561722, 1.4069512],
         'col4': [0.89303696, 0.539315, 0.79186934, 0.6743943, 1.064051],
         'col5': [0.4313375, 0.48167187, 0.48637152, 0.5441728, 0.48297918]}

df1 = pd.DataFrame(data1, index=index)
df2 = pd.DataFrame(data2, index=index)

提取数据

由于您没有指定所需的数据结构，因此这是上面 dict 理解中概述的策略：

{animal: df2[df1.eq(animal)].stack().values for animal in np.unique(df1)}

结果如下所示：

{'Ant': array([0.6743943 , 0.48297918]),
 'Bird': array([1.0051135, 0.7863044, 0.9470184, 0.7561722]),
 'Cat': array([1.3631996, 2.8661892, 2.5679462, 3.631389 , 2.4727197]),
 'Dog': array([3.219843 , 0.539315 , 1.3657334]),
 'Dragon': array([1.4396228, 1.064051 ]),
 'Fish': array([0.5441728]),
 'Hippo': array([0.48637152]),
 'Lion': array([0.4313375]),
 'Monkey': array([1.4069512]),
 'Tree': array([0.89303696, 0.48167187, 0.79186934, 0.94815284, 1.5941181 ])}

Answer 2

假设@fsimonjetz 提供了输入，你可以stack both dataframes, then GroupBy.agg as list:

df2.stack().groupby(df1.stack()).agg(list).to_dict()

或者，使用中间 DataFrame：

(pd
 .concat([df1.stack(),df2.stack()], axis=1)
 .groupby(0)[1].agg(list)
 .to_dict()
)

输出：

{'Ant': [0.6743943, 0.48297918],
 'Bird': [1.0051135, 0.7863044, 0.9470184, 0.7561722],
 'Cat': [1.3631996, 2.8661892, 2.5679462, 3.631389, 2.4727197],
 'Dog': [3.219843, 0.539315, 1.3657334],
 'Dragon': [1.4396228, 1.064051],
 'Fish': [0.5441728],
 'Hippo': [0.48637152],
 'Lion': [0.4313375],
 'Monkey': [1.4069512],
 'Tree': [0.89303696, 0.48167187, 0.79186934, 0.94815284, 1.5941181]}

提取与另一个 df 的单个元素相对应的数据帧的值

Extract the values of a dataframe that correspond to a single element of another df

python

information-retrieval

dataframe

pandas

构建数据帧

提取数据