使用字典 find/replace 标记化 pandas 系列中的确切术语

Use a dictionary to find/replace exact terms in a tokenized pandas series

使用字典,我需要根据以下条件在 pandas 系列中查找和替换术语:

  1. 字典值正在替换 pandas 系列中的任何字典键(例如,在 'mastersphd' 中:'masters phd',替换结果将是 'masters phd' 'mastersphd' 出现的地方)
  2. 保持记录完整性(即不能使用词袋方法,因为我需要唯一记录保持完整。)
  3. 只应替换完全匹配的(例如,如果 key:value 是 'rf': 'random forest',则替换不应转 'performance' 变成 'perandom forestormance');所以 regex=True 显然是造成这种情况的原因)

数据:term_fixes是字典,df['job_description']是感兴趣的标记化序列

term_fixes = {'rf': 'random forest',
              'mastersphd': 'masters phd',
              'curiosity': 'curious',
              'trustworthy': 'ethical',
              'realise': 'realize'}


df = pd.DataFrame(data={'job_description': [['knowledge', 'of', 'algorithm', 'like', 'rf'],
                                            ['must', 'have', 'a', 'mastersphd'],
                                            ['trustworthy', 'and', 'possess', 'curiosity'],
                                            ['we', 'realise', 'performance', 'is', 'key']]})

**注意:我也(未成功)尝试过未标记化的数据结构,但更喜欢标记化,因为我有更多的 NLP 要做

df = pd.DataFrame(data={'job_description': ['knowledge of algorithm like rf',
                                            'must have a mastersphd',
                                            'must be trustworthy and possess curiosity',
                                            'we realise performance is critical']})

**Desired Outcome(请注意,性能中的 'rf' 未被 'random forest' 取代): df['job_description']

0    ['knowledge' 'of' 'algorithm' 'like' 'random' 'forest']
1                        ['must' 'have' 'a' 'masters' 'phd']
2          ['must' 'be' 'ethical' 'and' 'possess' 'curious']
3             ['we' 'realize' 'performance' 'is' 'critical']

我尝试了很多方法。 失败:df['job_description'].replace(list(term_fixes.keys()), list(term_fixes.values()), regex=False, inplace=True)

失败:df['job_description'].replace(dict(zip(list(term_fixes.keys()), list(term_fixes.values()))), regex=False, inplace=True)

失败:df['job_description'] = df['job_description'].str.replace(term_fixes, regex=False)

失败:df['job_description'] = df['job_description'].str.replace(str(term_fixes.keys()), str(term_fixes.values()), regex=True)

我最接近的是:

df['job_description'] = df_jobs['job_description'].replace(term_fixes, regex=True)

但是,regex=True 会标记任何匹配项(如上面的 'rf' 和 'performance' 示例)。不幸的是,将标志更改为 regex=False 无法替换任何内容。我在文档中查看了我可以使用的另一个参数,但没有运气。请注意,这使用了未标记化的结构。

如有任何帮助,我们将不胜感激。谢谢!

您可以使用类似以下的方法来处理未标记化的数据。

for k in term_fixes:
    df['job_description'] = (df['job_description'].str.replace(r'(^|(?<= )){}((?= )|$)'.format(k), term_fixes[k]))

print(df)
                             job_description
0  knowledge of algorithm like random forest
1                    must have a masters phd
2        must be ethical and possess curious
3         we realize performance is critical

使用您的 df.

的“标记化”版本
df['job_description'] = df['job_description'].explode().replace(term_fixes).groupby(level=-1).agg(list)

# explode to get single terms per "cell"
# replace to replace the terms in "term_fixes"
# groupby to reverse the previous explode and return to a column of lists

                                   job_description
0  [knowledge, of, algorithm, like, random forest]
1                     [must, have, a, masters phd]
2                 [ethical, and, possess, curious]
3              [we, realize, performance, is, key]

如果您需要新术语也拆分为白色 space,那么您可以在最终 groupby

之前添加另一个中间步骤 .str.split().explode()
df['job_description'] = df['job_description'].explode().replace(term_fixes).str.split().explode().groupby(level=-1).agg(list)

                                    job_description
0  [knowledge, of, algorithm, like, random, forest] # random forest is now split
1                     [must, have, a, masters, phd] # masters phd is now split
2                  [ethical, and, possess, curious]
3               [we, realize, performance, is, key]