如何通过搜索列表中给定的关键字值在 Python pandas 数据框中添加新列?
How to add new Column in Python pandas dataframe by searching keyword value given in list?
我想根据已识别的关键字在数据框中添加新列:
这是当前数据(数据框名称 = df):
Topic Count
0 This is Python 39
1 This is SQL 6
2 This is Paython Pandas 98
3 import tkinter 81
4 Learning Python 94
5 SQL Working 85
6 Pandas and Work 67
7 This is Pandas 30
8 Computer 20
9 Mobile Work 55
10 Smart Mobile 69
我想要的输出如下
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
如何识别 Groups
列值
根据 Topics
列中的以下身份创建的组。
如果在 Topics
中找到特定的单词,则需要为其分配特定的组名
来自 Topic
列的关键字列表
Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']
我试过下面的代码:
df['Groups'] = [
'Python Group' if "Python" in x
else 'Python Group' if "Pandas" in x
else 'Python Group' if "tkinter" in x
else 'SQL Group' if "SQL" in x
else 'Devices Group' if "Computer" in x
else 'Devices Group' if "Mobile" in x
else '000'
for x in df['Topic']]
print(df)
上面的代码也给了我想要的输出,但我想让它更短和更快,因为在上面提到的数据框中有将近 2MM+ 的记录,我很难写 1k+ 行代码来定义分组。
有什么方法可以利用 关键字列表 属于 Topic
列?
或
任何可以在此迭代过程中帮助我的自定义函数?
代码:2 咨询Stack overflow专家后尝试的另一个代码:
d = pd.read_excel('Map.xlsx').to_dict('list')
keyword_groups = {x:k for k, v in d.items() for x in v}
pat = '({})'.format('|'.join(keyword_groups)) #This line is giving an error
df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
.map(keyword_groups)
.fillna('000'))
错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-543675c0b403> in <module>
3
4 keyword_groups = {x:k for k, v in d.items() for x in v}
----> 5 pat = '({})'.format('|'.join(keyword_groups))
6 pat
TypeError: sequence item 5: expected str instance, float found
谢谢你的帮助。
您可以使用 np.select
执行此操作。 np.select接收3个参数,一个条件,一个结果,最后一个没有找到条件时的默认值。
Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']
conditions = [
df['Topic'].str.contains('|'.join(Python_Group))
,df['Topic'].str.contains('|'.join(SQL_Group))
,df['Topic'].str.contains('|'.join(Devices_Group))
]
results = [
"Python_Group"
,"SQL_Group"
,"Devices_Group"
]
df['Groups'] = np.select(conditions, results, '000')
#output:
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
一种方法是考虑在 dict
:
中维护您的组和关键字
d = {'Python_Group': ['Python','Pandas','tkinter'],
'SQL_Group': ['SQL', 'Select'],
'Devices_Group': ['Computer','Mobile']}
从这里,您可以轻松地将其反转为“关键字:组”dict
。
keyword_groups = {x:k for k, v in d.items() for x in v}
# {'Python': 'Python_Group',
# 'Pandas': 'Python_Group',
# 'tkinter': 'Python_Group',
# 'SQL': 'SQL_Group',
# 'Select': 'SQL_Group',
# 'Computer': 'Devices_Group',
# 'Mobile': 'Devices_Group'}
然后您可以使用 Series.str.extract
to find these keywords using regex and map
them to the correct group. Use fillna
来捕获任何 non-matching 个组。
pat = '({})'.format('|'.join(keyword_groups))
df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
.map(keyword_groups)
.fillna('000'))
[出]
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
我想根据已识别的关键字在数据框中添加新列:
这是当前数据(数据框名称 = df):
Topic Count
0 This is Python 39
1 This is SQL 6
2 This is Paython Pandas 98
3 import tkinter 81
4 Learning Python 94
5 SQL Working 85
6 Pandas and Work 67
7 This is Pandas 30
8 Computer 20
9 Mobile Work 55
10 Smart Mobile 69
我想要的输出如下
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
如何识别 Groups
列值
根据 Topics
列中的以下身份创建的组。
如果在 Topics
中找到特定的单词,则需要为其分配特定的组名
来自 Topic
列的关键字列表
Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']
我试过下面的代码:
df['Groups'] = [
'Python Group' if "Python" in x
else 'Python Group' if "Pandas" in x
else 'Python Group' if "tkinter" in x
else 'SQL Group' if "SQL" in x
else 'Devices Group' if "Computer" in x
else 'Devices Group' if "Mobile" in x
else '000'
for x in df['Topic']]
print(df)
上面的代码也给了我想要的输出,但我想让它更短和更快,因为在上面提到的数据框中有将近 2MM+ 的记录,我很难写 1k+ 行代码来定义分组。
有什么方法可以利用 关键字列表 属于 Topic
列?
或
任何可以在此迭代过程中帮助我的自定义函数?
代码:2 咨询Stack overflow专家后尝试的另一个代码:
d = pd.read_excel('Map.xlsx').to_dict('list')
keyword_groups = {x:k for k, v in d.items() for x in v}
pat = '({})'.format('|'.join(keyword_groups)) #This line is giving an error
df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
.map(keyword_groups)
.fillna('000'))
错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-543675c0b403> in <module>
3
4 keyword_groups = {x:k for k, v in d.items() for x in v}
----> 5 pat = '({})'.format('|'.join(keyword_groups))
6 pat
TypeError: sequence item 5: expected str instance, float found
谢谢你的帮助。
您可以使用 np.select
执行此操作。 np.select接收3个参数,一个条件,一个结果,最后一个没有找到条件时的默认值。
Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']
conditions = [
df['Topic'].str.contains('|'.join(Python_Group))
,df['Topic'].str.contains('|'.join(SQL_Group))
,df['Topic'].str.contains('|'.join(Devices_Group))
]
results = [
"Python_Group"
,"SQL_Group"
,"Devices_Group"
]
df['Groups'] = np.select(conditions, results, '000')
#output:
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
一种方法是考虑在 dict
:
d = {'Python_Group': ['Python','Pandas','tkinter'],
'SQL_Group': ['SQL', 'Select'],
'Devices_Group': ['Computer','Mobile']}
从这里,您可以轻松地将其反转为“关键字:组”dict
。
keyword_groups = {x:k for k, v in d.items() for x in v}
# {'Python': 'Python_Group',
# 'Pandas': 'Python_Group',
# 'tkinter': 'Python_Group',
# 'SQL': 'SQL_Group',
# 'Select': 'SQL_Group',
# 'Computer': 'Devices_Group',
# 'Mobile': 'Devices_Group'}
然后您可以使用 Series.str.extract
to find these keywords using regex and map
them to the correct group. Use fillna
来捕获任何 non-matching 个组。
pat = '({})'.format('|'.join(keyword_groups))
df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
.map(keyword_groups)
.fillna('000'))
[出]
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group