如何通过搜索列表中给定的关键字值在 Python pandas 数据框中添加新列？

Question

我想根据已识别的关键字在数据框中添加新列：

这是当前数据（数据框名称 = df）：

    Topic                   Count
0   This is Python          39
1   This is SQL             6
2   This is Paython Pandas  98
3   import tkinter          81
4   Learning Python         94
5   SQL Working             85
6   Pandas and Work         67
7   This is Pandas          30
8   Computer                20
9   Mobile Work             55
10  Smart Mobile            69

我想要的输出如下

    Topic                   Count       Groups
0   This is Python          39          Python_Group
1   This is SQL             6           SQL_Group
2   This is Paython Pandas  98          Python_Group
3   import tkinter          81          Python_Group
4   Learning Python         94          Python_Group
5   SQL Working             85          SQL_Group
6   Pandas and Work         67          Python_Group
7   This is Pandas          30          Python_Group
8   Computer                20          Devices_Group
9   Mobile Work             55          Devices_Group
10  Smart Mobile            69          Devices_Group

如何识别 Groups 列值

根据 Topics 列中的以下身份创建的组。如果在 Topics 中找到特定的单词，则需要为其分配特定的组名

来自 Topic 列的关键字列表

Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']

我试过下面的代码：

df['Groups'] = [
    'Python Group' if "Python" in x 
    else 'Python Group' if "Pandas" in x
    else 'Python Group' if "tkinter" in x
    else 'SQL Group' if "SQL" in x
    else 'Devices Group' if "Computer" in x
    else 'Devices Group' if "Mobile" in x
    else '000' 
    for x in df['Topic']]
print(df)

上面的代码也给了我想要的输出，但我想让它更短和更快，因为在上面提到的数据框中有将近 2MM+ 的记录，我很难写 1k+ 行代码来定义分组。

有什么方法可以利用 关键字列表 属于 Topic 列？或任何可以在此迭代过程中帮助我的自定义函数？

代码：2 咨询Stack overflow专家后尝试的另一个代码：

d = pd.read_excel('Map.xlsx').to_dict('list')
keyword_groups = {x:k for k, v in d.items() for x in v}
pat = '({})'.format('|'.join(keyword_groups))   #This line is giving an error
df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
               .map(keyword_groups)
               .fillna('000'))

错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-131-543675c0b403> in <module>
      3 
      4 keyword_groups = {x:k for k, v in d.items() for x in v}
----> 5 pat = '({})'.format('|'.join(keyword_groups))
      6 pat

TypeError: sequence item 5: expected str instance, float found

谢谢你的帮助。

Answer 1

您可以使用 np.select 执行此操作。 np.select接收3个参数，一个条件，一个结果，最后一个没有找到条件时的默认值。

Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']

conditions = [
    df['Topic'].str.contains('|'.join(Python_Group))
    ,df['Topic'].str.contains('|'.join(SQL_Group))
    ,df['Topic'].str.contains('|'.join(Devices_Group))
]

results = [
    "Python_Group"
    ,"SQL_Group"
    ,"Devices_Group"
]

df['Groups'] = np.select(conditions, results, '000')
#output:
    Topic                   Count   Groups
0   This is Python          39      Python_Group
1   This is SQL             6       SQL_Group
2   This is Paython Pandas  98      Python_Group
3   import tkinter          81      Python_Group
4   Learning Python         94      Python_Group
5   SQL Working             85      SQL_Group
6   Pandas and Work         67      Python_Group
7   This is Pandas          30      Python_Group
8   Computer                20      Devices_Group
9   Mobile Work             55      Devices_Group
10  Smart Mobile            69      Devices_Group

Answer 2

一种方法是考虑在 dict:

中维护您的组和关键字

d = {'Python_Group': ['Python','Pandas','tkinter'],
     'SQL_Group': ['SQL', 'Select'],
     'Devices_Group': ['Computer','Mobile']}

从这里，您可以轻松地将其反转为“关键字：组”dict。

keyword_groups = {x:k for k, v in d.items() for x in v}

# {'Python': 'Python_Group',
#  'Pandas': 'Python_Group',
#  'tkinter': 'Python_Group',
#  'SQL': 'SQL_Group',
#  'Select': 'SQL_Group',
#  'Computer': 'Devices_Group',
#  'Mobile': 'Devices_Group'}

然后您可以使用 Series.str.extract to find these keywords using regex and map them to the correct group. Use fillna 来捕获任何 non-matching 个组。

pat = '({})'.format('|'.join(keyword_groups))

df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
               .map(keyword_groups)
               .fillna('000'))

[出]

                     Topic  Count          Groups
0           This is Python     39    Python_Group
1              This is SQL      6       SQL_Group
2   This is Paython Pandas     98    Python_Group
3           import tkinter     81    Python_Group
4          Learning Python     94    Python_Group
5              SQL Working     85       SQL_Group
6          Pandas and Work     67    Python_Group
7           This is Pandas     30    Python_Group
8                 Computer     20   Devices_Group
9              Mobile Work     55   Devices_Group
10            Smart Mobile     69   Devices_Group

如何通过搜索列表中给定的关键字值在 Python pandas 数据框中添加新列？

How to add new Column in Python pandas dataframe by searching keyword value given in list?

list-comprehension

python-3.x

pandas