计算 pandas 数据框中的值并使用这些值创建子数据框

Question

我有一个 pandas 数据框。我想计算一列中的所有值，以了解重复了哪些值。然后，我只想提取重复的值，我想用它们创建一个子数据框。

我们举个例子。假设这是我的数据框：

df =

    type        color       name
0   fruit       red         apple
1   fruit       yellow      banana
2   meat        brown       steak
3   fruit       green       apple
4   fruit       orange      orange
5   veg         orange      carrot
6   fruit       yellow      apple
7   meat        brown       steak
8   veg         orange      carrot

我想知道'name'列中是否有重复的值。为此，我使用了这行代码：

df['name'].value_counts().loc[lambda x : x>1]

这就是我得到的：

apple   3
steak   2
carrot  2

然后，我想创建一个子数据框，用“苹果”、“牛排”、“胡萝卜”过滤“名称”列，以找到与另一列相关的值。当然，这可以通过适当的功能来完成。

期望的输出是：

sub_df =

    type        color       name
0   fruit       red         apple
1   fruit       green       apple
2   fruit       yellow      apple
3   meat        steak       brown
4   meat        steak       brown
5   veg         orange      carrot
6   veg         orange      carrot

我尝试了不同类型的代码，但没有成功。我认为问题出在 df.count_values() 的使用上，因为它给了我一个包含出现次数的 Pandas 系列，但无法访问此方法计数的值。

有什么建议吗？

Answer 1

下次请提供更好的测试数据（数据复制粘贴）

我认为您想要的输出是错误的，因为 color 列中有一个 steak 值。

我已经尝试了以下应该可以满足您要求的方法。我想你理解代码，我只添加了以下行：

df[df["name"].isin(y.index.tolist())]

它在数据框的 name 列中搜索系列索引值的所有值 (isin)。如果您想拥有一个带有自己索引的完整新数据框，您可以在上面的行中添加 .reset_index()。

import pandas as pd

df = pd.DataFrame([
    ["fruit", "red", "apple"],
    ["fruit", "yellow", "banana"],
    ["meat", "brown", "steak"],
    ["fruit", "green", "apple"],
    ["fruit", "orange", "orange"],
    ["veg", "orange", "carrot"],
    ["fruit", "yellow", "apple"],
    ["meat", "brown", "steak"],
    ["veg", "orange", "carrot"]
],
    columns=["type", "color", "name"])

print(df)

y = df['name'].value_counts().loc[lambda x: x > 1]

print(y)

df_2 = df[df["name"].isin(y.index.tolist())]

print(df_2)

输出：

    type   color    name
0  fruit     red   apple
1  fruit  yellow  banana
2   meat   brown   steak
3  fruit   green   apple
4  fruit  orange  orange
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot
apple     3
steak     2
carrot    2
Name: name, dtype: int64
    type   color    name
0  fruit     red   apple
2   meat   brown   steak
3  fruit   green   apple
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot

Answer 2

您不需要分两步执行此操作，这里是如何使用 groupby 和 filter 来实现最终结果：

df.groupby('name').filter(lambda g: g['type'].count() > 1).sort_values('name')

输出：


    type    color   name
0   fruit   red     apple
3   fruit   green   apple
6   fruit   yellow  apple
5   veg     orange  carrot
8   veg     orange  carrot
2   meat    brown   steak
7   meat    brown   steak

计算 pandas 数据框中的值并使用这些值创建子数据框

Count values in a pandas dataframe and use those some values for creating a subdataframe

python

distinct-values

pandas