根据标记化 pandas 数据框中出现的特定术语创建新的布尔字段
Create new boolean fields based on specific terms appearing in a tokenized pandas dataframe
循环搜索要搜索的术语列表,我需要根据每个术语是否出现在标记化 pandas 系列中来为每个术语创建一个布尔字段。
搜索词列表:
terms = ['innovative', 'data', 'rf']
数据框:
df = pd.DataFrame(data={'job_description': [['innovative', 'data', 'science'],
['scientist', 'have', 'a', 'masters'],
['database', 'rf', 'innovative'],
['sciencedata', 'data', 'performance']]})
df 的期望输出:
job_description innovative data rf
0 [innovative, data, science] True True False
1 [scientist, have, a, masters] False False False
2 [database, rf, innovative] True False True
3 [sciencedata, data, performance] False True False
条件:
- 只应替换完全匹配项(例如,'rf' 的标记应 return
True
'rf' 但False
对于 'performance')
- 每个搜索词都应该有自己的字段并连接到原始 df
我尝试过的:
失败,因为它为系列中的每个术语创建了一个布尔值:
df['innovative'] = df['job_description'].explode().str.contains(r'innovative').groupby(level=-1).agg(list)
失败:
df['innovative'] = df['job_description'].str.contains('innovative').astype(int, errors='ignore')
失败:
df.loc[df['job_description'].str.contains(terms)] = 1
失败:我尝试实施此处记录的内容 (),但无法对其进行调整以正确创建新字段或标记
感谢您提供的任何帮助!
似乎您可以使用嵌套列表理解来评估每行中是否存在每个术语,并将列表分配给 df
中的列:
df[terms] = [[any(w==term for w in lst) for term in terms] for lst in df['job_description']]
输出:
job_description innovative data rf
0 [innovative, data, science] True True False
1 [scientist, have, a, masters] False False False
2 [database, rf, innovative] True False True
3 [sciencedata, data, performance] False True False
这应该很快:
e = df['job_description'].explode()
new_df[terms] = pd.concat([e.eq(t).rename(t) for t in terms], axis=1).groupby(level=0).any()
输出:
>>> new_df
job_description innovative data rf
0 [innovative, data, science] True True False
1 [scientist, have, a, masters] False False False
2 [database, rf, innovative] True False True
3 [sciencedata, data, performance] False True False
循环搜索要搜索的术语列表,我需要根据每个术语是否出现在标记化 pandas 系列中来为每个术语创建一个布尔字段。
搜索词列表:
terms = ['innovative', 'data', 'rf']
数据框:
df = pd.DataFrame(data={'job_description': [['innovative', 'data', 'science'],
['scientist', 'have', 'a', 'masters'],
['database', 'rf', 'innovative'],
['sciencedata', 'data', 'performance']]})
df 的期望输出:
job_description innovative data rf
0 [innovative, data, science] True True False
1 [scientist, have, a, masters] False False False
2 [database, rf, innovative] True False True
3 [sciencedata, data, performance] False True False
条件:
- 只应替换完全匹配项(例如,'rf' 的标记应 return
True
'rf' 但False
对于 'performance') - 每个搜索词都应该有自己的字段并连接到原始 df
我尝试过的: 失败,因为它为系列中的每个术语创建了一个布尔值:
df['innovative'] = df['job_description'].explode().str.contains(r'innovative').groupby(level=-1).agg(list)
失败:
df['innovative'] = df['job_description'].str.contains('innovative').astype(int, errors='ignore')
失败:
df.loc[df['job_description'].str.contains(terms)] = 1
失败:我尝试实施此处记录的内容 (
感谢您提供的任何帮助!
似乎您可以使用嵌套列表理解来评估每行中是否存在每个术语,并将列表分配给 df
中的列:
df[terms] = [[any(w==term for w in lst) for term in terms] for lst in df['job_description']]
输出:
job_description innovative data rf
0 [innovative, data, science] True True False
1 [scientist, have, a, masters] False False False
2 [database, rf, innovative] True False True
3 [sciencedata, data, performance] False True False
这应该很快:
e = df['job_description'].explode()
new_df[terms] = pd.concat([e.eq(t).rename(t) for t in terms], axis=1).groupby(level=0).any()
输出:
>>> new_df
job_description innovative data rf
0 [innovative, data, science] True True False
1 [scientist, have, a, masters] False False False
2 [database, rf, innovative] True False True
3 [sciencedata, data, performance] False True False