根据标记化 pandas 数据框中出现的特定术语创建新的布尔字段

Create new boolean fields based on specific terms appearing in a tokenized pandas dataframe

循环搜索要搜索的术语列表,我需要根据每个术语是否出现在标记化 pandas 系列中来为每个术语创建一个布尔字段。

搜索词列表:

terms = ['innovative', 'data', 'rf']

数据框:

df = pd.DataFrame(data={'job_description': [['innovative', 'data', 'science'],
                                            ['scientist', 'have', 'a', 'masters'],
                                            ['database', 'rf', 'innovative'],
                                            ['sciencedata', 'data', 'performance']]})

df 的期望输出:

                     job_description  innovative   data     rf
0        [innovative, data, science]        True   True  False
1      [scientist, have, a, masters]       False  False  False
2         [database, rf, innovative]        True  False   True
3  [sciencedata, data, performance]        False   True  False

条件:

  1. 只应替换完全匹配项(例如,'rf' 的标记应 return True 'rf' 但False 对于 'performance')
  2. 每个搜索词都应该有自己的字段并连接到原始 df

我尝试过的: 失败,因为它为系列中的每个术语创建了一个布尔值:

df['innovative'] = df['job_description'].explode().str.contains(r'innovative').groupby(level=-1).agg(list)

失败:

df['innovative'] = df['job_description'].str.contains('innovative').astype(int, errors='ignore')

失败:

df.loc[df['job_description'].str.contains(terms)] = 1

失败:我尝试实施此处记录的内容 (),但无法对其进行调整以正确创建新字段或标记

感谢您提供的任何帮助!

似乎您可以使用嵌套列表理解来评估每行中是否存在每个术语,并将列表分配给 df 中的列:

df[terms] = [[any(w==term for w in lst) for term in terms] for lst in df['job_description']]

输出:

                    job_description  innovative   data     rf
0       [innovative, data, science]        True   True  False
1     [scientist, have, a, masters]       False  False  False
2        [database, rf, innovative]        True  False   True
3  [sciencedata, data, performance]       False   True  False

这应该很快:

e = df['job_description'].explode()
new_df[terms] = pd.concat([e.eq(t).rename(t) for t in terms], axis=1).groupby(level=0).any()

输出:

>>> new_df
                    job_description  innovative   data     rf
0       [innovative, data, science]        True   True  False
1     [scientist, have, a, masters]       False  False  False
2        [database, rf, innovative]        True  False   True
3  [sciencedata, data, performance]       False   True  False