名称实体替换 - Pandas 带有文本列的数据框 - 预处理

Name Entities Replacement - Pandas Dataframe with text column - Preprocessing

我有一个带有句子(文本)列的数据框。 我想执行名称实体替换:我有一个列表,其元素是股票信息

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

我想在我的数据框中找到句子 'symbol' 和 'company',以便用 '<TCK>' 替换符号,用 '<CMPY>' 替换公司。该函数必须应用于所有行。

我正在寻找一个函数,它接收带有标记化文本的数据帧和 returns 处理后的文本。 重要的是要匹配整个公司名称,而不仅仅是名称的一个元素。关于符号,我知道它有点困难,因为在文本中很容易找到 'V'(签证符号),但是我来这里是为了听到一些好的解决方法

让我们举个例子开始:

print(dataframe['text'])

输出:

0  [GS is the main company of Dow Jones]
1  [Once again Visa surprises all]*
2  [Johnson & Johnson's vaccine is the best one]

我想要一个具有以下结果的新专栏:

0  [<TKR> is the main company of Dow Jones]
1  [Once again <CMPY> surprises all]*
2  [<CMPY>'s vaccine is the best one] 

第 1 行 --> 棘手,因为公司的真实名称是 'Visa Inc.' 而不仅仅是 Visa...我真的不知道如何处理它。

我不知道使用标记化句子是否更好:因为在那种情况下我还需要标记化“公司”,例如高盛。

您可以使用

import pandas as pd
import re

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

def process_term(term):
    t = [x for x in term.split()]
    first = t[0]
    if len(t) > 1:
        first = first + r"(?:\s+{}{}".format("(?:\s+".join(map(re.escape,t[1:])), ")?" * (len(t)-1))
    return first

dataframe = pd.DataFrame({'text':['GS is the main company of Dow Jones', 'Once again Visa surprises all', "Johnson & Johnson's vaccine is the best one"]})
rx_symbol = r"\b(?:{})\b".format("|".join([x["symbol"] for x in stocks]))
rx_company = r"\b(?:{})(?!\w)".format("|".join(sorted([process_term(x["company"]) for x in stocks], key=len, reverse=True)))
dataframe['new_text'] = dataframe['text'].str.replace(rx_symbol, r'<TKR>', regex=True)
dataframe['new_text'] = dataframe['new_text'].str.replace(rx_company, r'<CMPY>', regex=True)
>>> dataframe
# =>                                           text                                # new_text
# => 0          GS is the main company of Dow Jones  <TKR> is the main company of Dow Jones
# => 1                Once again Visa surprises all         Once again <CMPY> surprises all
# => 2  Johnson & Johnson's vaccine is the best one        <CMPY>'s vaccine is the best one

简而言之:

  • symbolcompany 数据中创建两个正则表达式和 运行 两个 replace 操作
  • symbol 正则表达式很简单,它看起来像 \b(?:GS|JPM|TRV|V|AMGN|JNJ)\b 并且匹配括号中的任何替代项作为一个完整的词
  • company 正则表达式遵循 Regular expression to match A, AB, ABC, but not AC. ("starts with"). It looks like \b(?:The(?:\s+Travelers(?:\s+Companies)?)?|Johnson(?:\s+\&(?:\s+Johnson)?)?|JPMorgan(?:\s+Chase)?|Goldman(?:\s+Sachs)?|Visa(?:\s+Inc\.)?|Amgen)(?!\w): each company name is re.escaped, and each subsequent word only matches (optionally) if the previous term word is matched. See the regex demo 中描述的后缀方法。请注意,右侧单词边界设置为 (?!\w) 前瞻,因为如果术语以非单词字符结尾,\b 将阻止匹配。