使用字典 find/replace 标记化 pandas 系列中的确切术语
Use a dictionary to find/replace exact terms in a tokenized pandas series
使用字典,我需要根据以下条件在 pandas 系列中查找和替换术语:
- 字典值正在替换 pandas 系列中的任何字典键(例如,在 'mastersphd' 中:'masters phd',替换结果将是 'masters phd' 'mastersphd' 出现的地方)
- 保持记录完整性(即不能使用词袋方法,因为我需要唯一记录保持完整。)
- 只应替换完全匹配的(例如,如果 key:value 是 'rf': 'random forest',则替换不应转 'performance' 变成 'perandom forestormance');所以 regex=True 显然是造成这种情况的原因)
数据:term_fixes是字典,df['job_description']是感兴趣的标记化序列
term_fixes = {'rf': 'random forest',
'mastersphd': 'masters phd',
'curiosity': 'curious',
'trustworthy': 'ethical',
'realise': 'realize'}
df = pd.DataFrame(data={'job_description': [['knowledge', 'of', 'algorithm', 'like', 'rf'],
['must', 'have', 'a', 'mastersphd'],
['trustworthy', 'and', 'possess', 'curiosity'],
['we', 'realise', 'performance', 'is', 'key']]})
**注意:我也(未成功)尝试过未标记化的数据结构,但更喜欢标记化,因为我有更多的 NLP 要做
df = pd.DataFrame(data={'job_description': ['knowledge of algorithm like rf',
'must have a mastersphd',
'must be trustworthy and possess curiosity',
'we realise performance is critical']})
**Desired Outcome(请注意,性能中的 'rf' 未被 'random forest' 取代):
df['job_description']
0 ['knowledge' 'of' 'algorithm' 'like' 'random' 'forest']
1 ['must' 'have' 'a' 'masters' 'phd']
2 ['must' 'be' 'ethical' 'and' 'possess' 'curious']
3 ['we' 'realize' 'performance' 'is' 'critical']
我尝试了很多方法。
失败:df['job_description'].replace(list(term_fixes.keys()), list(term_fixes.values()), regex=False, inplace=True)
失败:df['job_description'].replace(dict(zip(list(term_fixes.keys()), list(term_fixes.values()))), regex=False, inplace=True)
失败:df['job_description'] = df['job_description'].str.replace(term_fixes, regex=False)
失败:df['job_description'] = df['job_description'].str.replace(str(term_fixes.keys()), str(term_fixes.values()), regex=True)
我最接近的是:
df['job_description'] = df_jobs['job_description'].replace(term_fixes, regex=True)
但是,regex=True 会标记任何匹配项(如上面的 'rf' 和 'performance' 示例)。不幸的是,将标志更改为 regex=False 无法替换任何内容。我在文档中查看了我可以使用的另一个参数,但没有运气。请注意,这使用了未标记化的结构。
如有任何帮助,我们将不胜感激。谢谢!
您可以使用类似以下的方法来处理未标记化的数据。
for k in term_fixes:
df['job_description'] = (df['job_description'].str.replace(r'(^|(?<= )){}((?= )|$)'.format(k), term_fixes[k]))
print(df)
job_description
0 knowledge of algorithm like random forest
1 must have a masters phd
2 must be ethical and possess curious
3 we realize performance is critical
使用您的 df
.
的“标记化”版本
df['job_description'] = df['job_description'].explode().replace(term_fixes).groupby(level=-1).agg(list)
# explode to get single terms per "cell"
# replace to replace the terms in "term_fixes"
# groupby to reverse the previous explode and return to a column of lists
job_description
0 [knowledge, of, algorithm, like, random forest]
1 [must, have, a, masters phd]
2 [ethical, and, possess, curious]
3 [we, realize, performance, is, key]
如果您需要新术语也拆分为白色 space,那么您可以在最终 groupby
之前添加另一个中间步骤 .str.split().explode()
df['job_description'] = df['job_description'].explode().replace(term_fixes).str.split().explode().groupby(level=-1).agg(list)
job_description
0 [knowledge, of, algorithm, like, random, forest] # random forest is now split
1 [must, have, a, masters, phd] # masters phd is now split
2 [ethical, and, possess, curious]
3 [we, realize, performance, is, key]
使用字典,我需要根据以下条件在 pandas 系列中查找和替换术语:
- 字典值正在替换 pandas 系列中的任何字典键(例如,在 'mastersphd' 中:'masters phd',替换结果将是 'masters phd' 'mastersphd' 出现的地方)
- 保持记录完整性(即不能使用词袋方法,因为我需要唯一记录保持完整。)
- 只应替换完全匹配的(例如,如果 key:value 是 'rf': 'random forest',则替换不应转 'performance' 变成 'perandom forestormance');所以 regex=True 显然是造成这种情况的原因)
数据:term_fixes是字典,df['job_description']是感兴趣的标记化序列
term_fixes = {'rf': 'random forest',
'mastersphd': 'masters phd',
'curiosity': 'curious',
'trustworthy': 'ethical',
'realise': 'realize'}
df = pd.DataFrame(data={'job_description': [['knowledge', 'of', 'algorithm', 'like', 'rf'],
['must', 'have', 'a', 'mastersphd'],
['trustworthy', 'and', 'possess', 'curiosity'],
['we', 'realise', 'performance', 'is', 'key']]})
**注意:我也(未成功)尝试过未标记化的数据结构,但更喜欢标记化,因为我有更多的 NLP 要做
df = pd.DataFrame(data={'job_description': ['knowledge of algorithm like rf',
'must have a mastersphd',
'must be trustworthy and possess curiosity',
'we realise performance is critical']})
**Desired Outcome(请注意,性能中的 'rf' 未被 'random forest' 取代): df['job_description']
0 ['knowledge' 'of' 'algorithm' 'like' 'random' 'forest']
1 ['must' 'have' 'a' 'masters' 'phd']
2 ['must' 'be' 'ethical' 'and' 'possess' 'curious']
3 ['we' 'realize' 'performance' 'is' 'critical']
我尝试了很多方法。
失败:df['job_description'].replace(list(term_fixes.keys()), list(term_fixes.values()), regex=False, inplace=True)
失败:df['job_description'].replace(dict(zip(list(term_fixes.keys()), list(term_fixes.values()))), regex=False, inplace=True)
失败:df['job_description'] = df['job_description'].str.replace(term_fixes, regex=False)
失败:df['job_description'] = df['job_description'].str.replace(str(term_fixes.keys()), str(term_fixes.values()), regex=True)
我最接近的是:
df['job_description'] = df_jobs['job_description'].replace(term_fixes, regex=True)
但是,regex=True 会标记任何匹配项(如上面的 'rf' 和 'performance' 示例)。不幸的是,将标志更改为 regex=False 无法替换任何内容。我在文档中查看了我可以使用的另一个参数,但没有运气。请注意,这使用了未标记化的结构。
如有任何帮助,我们将不胜感激。谢谢!
您可以使用类似以下的方法来处理未标记化的数据。
for k in term_fixes:
df['job_description'] = (df['job_description'].str.replace(r'(^|(?<= )){}((?= )|$)'.format(k), term_fixes[k]))
print(df)
job_description
0 knowledge of algorithm like random forest
1 must have a masters phd
2 must be ethical and possess curious
3 we realize performance is critical
使用您的 df
.
df['job_description'] = df['job_description'].explode().replace(term_fixes).groupby(level=-1).agg(list)
# explode to get single terms per "cell"
# replace to replace the terms in "term_fixes"
# groupby to reverse the previous explode and return to a column of lists
job_description
0 [knowledge, of, algorithm, like, random forest]
1 [must, have, a, masters phd]
2 [ethical, and, possess, curious]
3 [we, realize, performance, is, key]
如果您需要新术语也拆分为白色 space,那么您可以在最终 groupby
.str.split().explode()
df['job_description'] = df['job_description'].explode().replace(term_fixes).str.split().explode().groupby(level=-1).agg(list)
job_description
0 [knowledge, of, algorithm, like, random, forest] # random forest is now split
1 [must, have, a, masters, phd] # masters phd is now split
2 [ethical, and, possess, curious]
3 [we, realize, performance, is, key]