根据 Pandas 中的列内容连接两个 csv 文件

Question

我有两个包含示例数据的大型 CSV 文件，如下所示：

df1 = 
Index    Fruit   Vegetable    
    0    Mango   Spinach
    1    Berry   Carrot
    2    Banana  Cabbage

df2 = 
Index   Unit        Price
   0    Mango       30
   1    Artichoke   45
   2    Banana      12
   3    Berry       10
   4    Cabbage     25
   5    Rice        40
   6    Spinach     34
   7    Carrot      08
   8    Lentil      12
   9    Pot         32

我想创建以下数据框：

df3 = 
Index    Fruit   Price      Vegetable    Price   
    0    Mango   30         Spinach      34
    1    Berry   10         Carrot       08   
    2    Banana  12         Cabbage      25

我想在 df1 中按行比较每个单位的价格。如果价格在 5 美元以内，我想将它们输出到一个单独的数据框中，如下所示：

df4 = 
Index    Fruit   Price      Vegetable    Price   
    0    Mango   30         Spinach      34
    1    Berry   10         Carrot       08

实现该目标的通用方法是什么？提前谢谢你。

Answer 1

您可以使用 replace 基于 df2 创建价格数据框，然后 join 与原始数据连接。

请注意，不鼓励使用重复的列名：

# print to see what it does
item_prices = dict(zip(df2.Unit, df2.Price))

out = df1.join(df1.replace(item_prices).add_suffix('_Price')).sort_index(axis=1)

输出：

        Fruit  Fruit_Price Vegetable  Vegetable_Price
Index                                                
0       Mango           30   Spinach               34
1       Berry           10    Carrot                8
2      Banana           12   Cabbage               25

对于下一个问题，您需要一个 boolean loc 访问：

out[abs(out['Fruit_Price'] - out['Vegetable_Price']) < 5]

或 query:

out.query('abs(Fruit_Price-Vegetable_Price)<5')

输出：

       Fruit  Fruit_Price Vegetable  Vegetable_Price
Index                                               
0      Mango           30   Spinach               34
1      Berry           10    Carrot                8

Answer 2

您可以使用双重合并：

fruit = df1[['Fruit']].merge(df2.rename(columns={'Unit': 'Fruit'}), on='Fruit')
veggie = df1[['Vegetable']].merge(df2.rename(columns={'Unit': 'Vegetable'}), on='Vegetable')

df3 = pd.concat([fruit, veggie], axis=1)
print(df3)

# Output:
    Fruit  Price Vegetable  Price
0   Mango     30   Spinach     34
1   Berry     10    Carrot      8
2  Banana     12   Cabbage     25

然后

df4 = df3[np.abs(np.subtract(*out['Price'].values.T)) <= 5]
print(df4)

# Output:
   Fruit  Price Vegetable  Price
0  Mango     30   Spinach     34
1  Berry     10    Carrot      8

Answer 3

一个通用的替代方案（可以处理任意数量的类别）是在之前（使用 melt）和之后（使用 pivot）进行整形。这具有创建一个非常方便明确标识价格类别的 MultiIndex 的优点：

out = (df1.melt(id_vars='Index', value_name='Unit')
          .merge(df2.drop(columns='Index'), on='Unit')
          .pivot(index='Index', columns='variable', values=['Unit', 'Price'])
       )

输出：

            Unit           Price          
variable   Fruit Vegetable Fruit Vegetable
Index                                     
0          Mango   Spinach    30        34
1          Berry    Carrot    10         8
2         Banana   Cabbage    12        25

对 diff ≤ 5 的行进行子集化：

out[out['Price'].diff(axis=1).abs().le(5).any(1)]

输出：

           Unit           Price          
variable  Fruit Vegetable Fruit Vegetable
Index                                    
0         Mango   Spinach    30        34
1         Berry    Carrot    10         8

根据 Pandas 中的列内容连接两个 csv 文件

Concatenate two csv files based on column content in Pandas

python

merge

dataframe

pandas

对 diff ≤ 5 的行进行子集化：