当该数据帧的一列包含该数据帧中另一列的子字符串时，有没有办法只保留数据帧中的行？

Question

我有一个数据集：

                                  id                                   key            value
24                Apple Inc_Desktops    revenue_rgs_category_-_pc_monitors              nan
2                 Apple Inc_Desktops  revenue_rgs_category_-_mobile_phones 142381000000.000
46                Apple Inc_Desktops     revenue_rgs_category_-_smart_tech  24482000000.000
13                Apple Inc_Desktops    revenue_rgs_category_-_desktop_pcs  12870000000.000
35                Apple Inc_Desktops        revenue_rgs_category_-_tablets  21280000000.000
1                  Apple Inc_Laptops  revenue_rgs_category_-_mobile_phones 142381000000.000
45                 Apple Inc_Laptops     revenue_rgs_category_-_smart_tech  24482000000.000
23                 Apple Inc_Laptops    revenue_rgs_category_-_pc_monitors              nan
34                 Apple Inc_Laptops        revenue_rgs_category_-_tablets  21280000000.000
12                 Apple Inc_Laptops    revenue_rgs_category_-_desktop_pcs  12870000000.000
25            Apple Inc_MobilePhones    revenue_rgs_category_-_pc_monitors              nan
14            Apple Inc_MobilePhones    revenue_rgs_category_-_desktop_pcs  12870000000.000
36            Apple Inc_MobilePhones        revenue_rgs_category_-_tablets  21280000000.000
47            Apple Inc_MobilePhones     revenue_rgs_category_-_smart_tech  24482000000.000
3             Apple Inc_MobilePhones  revenue_rgs_category_-_mobile_phones 142381000000.000

我只想在列 key 包含来自列 id 的子字符串时保留这些行。例如，如下图所示，我只想保留索引为 13,3 的行，因为对于这些行，'key' 列包含 id 列的一部分 - 例如，对于带有 [= 的行16=] 3, 'Mobile' 包含在 key 列中。

所以我想要的输出是：

                                  id                                   key            value
13                Apple Inc_Desktops    revenue_rgs_category_-_desktop_pcs  12870000000.000
3             Apple Inc_MobilePhones  revenue_rgs_category_-_mobile_phones 142381000000.000

我试图创建一个新的指示 'key' 列是否包含 'id' 列的子字符串，但运气不好：

comp_rev_long['check'] = comp_rev_long['key'].str.contains('|'.join(comp_rev_long['id']),case=False)

关于有效执行此操作的任何想法？提前致谢。

Answer 1

解决您的问题的方法是检查索引列中是否存在键列中的字符串。在下面的示例中，我构建了一个 df（因为您没有提供，其中一列包含字符串的列，出现在另一列中：

import pandas as pd

a1,b2,c3 = 'ANDGFEEHsdsdSHSHS','FKDsdsdKSDKSDKS','DSLDJSLffsfsKDdSLDJS' 
s1, s2, s3 = 1,3,1
e1, e2, e3 = 3,6,6
df = pd.DataFrame({'key':[a1,b2,c3],'start': [s1, s2, s3],'end': [e1, e2, e3]})

df = df[['key', 'start', 'end']]
df['sliced'] = df.apply(fn, axis = 1)
aa1,bb2,cc3 = 'ANDGFEEHsdsdSHSHS','FKDsdsdKSDKSDKS','DSLDJSLffsfsKDdSLDJS' 
ss1, ss2, ss3 = 2,2,2
ee1, ee2, ee3 = 1,1,1
df2 = pd.DataFrame({'key':[aa1,bb2,cc3],'start': [ss1, ss2, ss3],'end': [ee1, ee2, ee3]})
df2 = df2[['key', 'start', 'end']]
dff = df.append(df2)

您应用它来确定一列中的字符串是否存在于 key:

df['Check'] = df.apply(lambda x: x.sliced in x.key, axis=1)

True.

切片

Answer 2

下面是一些可以帮助您入门的代码：

import numpy as np
import pandas as pd

np.random.seed(1)
# I create a simple DataFrame
df = pd.DataFrame({"id": np.random.choice(["apple", "banana", "cherry"], 15),
                   "key":  np.random.choice(["apple pie", "banana pie", "cherry pie"], 15),
                   "value":    np.random.randint(0,20, 15)})

df 看起来像这样：

        id         key  value
0   banana  cherry pie     13
1    apple  banana pie      9
2    apple  cherry pie      9
3   banana   apple pie      7
4   banana   apple pie      1
5    apple  cherry pie      0
6    apple   apple pie     17
7   banana  banana pie      8
8    apple  cherry pie     13
9   banana  cherry pie     19
10   apple   apple pie     15
11  cherry  banana pie     10
12  banana  banana pie      8
13  cherry  cherry pie      7
14   apple   apple pie      3

这是一个简单的选项，可以 select 只显示满足特定条件的行。

# create a function that checks if a row satisfies your condition
check_condition = lambda row: row["id"] in row["key"]

# create a new column that determines whether you keep the row
# by applying the check_condition function row wise (-> axis=1)
df["keep_row"] =  df.apply(check_condition, axis=1)

# finally select and keep only the desired rows 
df = df[df["keep_row"]]

现在 df 看起来像这样：

        id         key  value  keep_row
6    apple   apple pie     17      True
7   banana  banana pie      8      True
10   apple   apple pie     15      True
12  banana  banana pie      8      True
13  cherry  cherry pie      7      True
14   apple   apple pie      3      True

最后一个问题是如何检查一个子字符串是否包含在另一个字符串中。有几种方法可以解决这个问题。

替换值以使此操作变得微不足道，例如。 row["id"] in row["key"]
如果你只需要知道是mobile还是pc，就新建一个'device'列。
随便写代码，虽然有点麻烦

这个 check_condition 可能有用，因为看到了你的数据，但我当然不能确定。

def check_condition(row):
    for i in row["id"].lower().split('_'):
        if i in row["key"].lower():
            return True
        elif i[:-1] in row["key"].lower(): # account for the final 's'
            return True
    return False

2条笔记：

这不是一个 lambda 函数，但在本例中它等同于一个函数，因此您可以用这个函数替换 lambda check_condition-函数。
另请注意，在“id”和“key”列中，有些词以“-s”结尾，有些则没有，因此也需要考虑这一点。

当该数据帧的一列包含该数据帧中另一列的子字符串时，有没有办法只保留数据帧中的行？

Is there a way to keep only rows in a DataFrame, when a column of that dataframe contains a substring of another column in that dataframe?

substring

contains

rows

similarity

pandas