名称实体替换 - Pandas 带有文本列的数据框 - 预处理
Name Entities Replacement - Pandas Dataframe with text column - Preprocessing
我有一个带有句子(文本)列的数据框。
我想执行名称实体替换:我有一个列表,其元素是股票信息
stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]
我想在我的数据框中找到句子 'symbol' 和 'company',以便用 '<TCK>'
替换符号,用 '<CMPY>'
替换公司。该函数必须应用于所有行。
我正在寻找一个函数,它接收带有标记化文本的数据帧和 returns 处理后的文本。
重要的是要匹配整个公司名称,而不仅仅是名称的一个元素。关于符号,我知道它有点困难,因为在文本中很容易找到 'V'(签证符号),但是我来这里是为了听到一些好的解决方法
让我们举个例子开始:
print(dataframe['text'])
输出:
0 [GS is the main company of Dow Jones]
1 [Once again Visa surprises all]*
2 [Johnson & Johnson's vaccine is the best one]
我想要一个具有以下结果的新专栏:
0 [<TKR> is the main company of Dow Jones]
1 [Once again <CMPY> surprises all]*
2 [<CMPY>'s vaccine is the best one]
第 1 行 --> 棘手,因为公司的真实名称是 'Visa Inc.' 而不仅仅是 Visa...我真的不知道如何处理它。
我不知道使用标记化句子是否更好:因为在那种情况下我还需要标记化“公司”,例如高盛。
您可以使用
import pandas as pd
import re
stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]
def process_term(term):
t = [x for x in term.split()]
first = t[0]
if len(t) > 1:
first = first + r"(?:\s+{}{}".format("(?:\s+".join(map(re.escape,t[1:])), ")?" * (len(t)-1))
return first
dataframe = pd.DataFrame({'text':['GS is the main company of Dow Jones', 'Once again Visa surprises all', "Johnson & Johnson's vaccine is the best one"]})
rx_symbol = r"\b(?:{})\b".format("|".join([x["symbol"] for x in stocks]))
rx_company = r"\b(?:{})(?!\w)".format("|".join(sorted([process_term(x["company"]) for x in stocks], key=len, reverse=True)))
dataframe['new_text'] = dataframe['text'].str.replace(rx_symbol, r'<TKR>', regex=True)
dataframe['new_text'] = dataframe['new_text'].str.replace(rx_company, r'<CMPY>', regex=True)
>>> dataframe
# => text # new_text
# => 0 GS is the main company of Dow Jones <TKR> is the main company of Dow Jones
# => 1 Once again Visa surprises all Once again <CMPY> surprises all
# => 2 Johnson & Johnson's vaccine is the best one <CMPY>'s vaccine is the best one
简而言之:
- 从
symbol
和 company
数据中创建两个正则表达式和 运行 两个 replace
操作
symbol
正则表达式很简单,它看起来像 \b(?:GS|JPM|TRV|V|AMGN|JNJ)\b
并且匹配括号中的任何替代项作为一个完整的词
company
正则表达式遵循 Regular expression to match A, AB, ABC, but not AC. ("starts with"). It looks like \b(?:The(?:\s+Travelers(?:\s+Companies)?)?|Johnson(?:\s+\&(?:\s+Johnson)?)?|JPMorgan(?:\s+Chase)?|Goldman(?:\s+Sachs)?|Visa(?:\s+Inc\.)?|Amgen)(?!\w)
: each company name is re.escape
d, and each subsequent word only matches (optionally) if the previous term word is matched. See the regex demo 中描述的后缀方法。请注意,右侧单词边界设置为 (?!\w)
前瞻,因为如果术语以非单词字符结尾,\b
将阻止匹配。
我有一个带有句子(文本)列的数据框。 我想执行名称实体替换:我有一个列表,其元素是股票信息
stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]
我想在我的数据框中找到句子 'symbol' 和 'company',以便用 '<TCK>'
替换符号,用 '<CMPY>'
替换公司。该函数必须应用于所有行。
我正在寻找一个函数,它接收带有标记化文本的数据帧和 returns 处理后的文本。 重要的是要匹配整个公司名称,而不仅仅是名称的一个元素。关于符号,我知道它有点困难,因为在文本中很容易找到 'V'(签证符号),但是我来这里是为了听到一些好的解决方法
让我们举个例子开始:
print(dataframe['text'])
输出:
0 [GS is the main company of Dow Jones]
1 [Once again Visa surprises all]*
2 [Johnson & Johnson's vaccine is the best one]
我想要一个具有以下结果的新专栏:
0 [<TKR> is the main company of Dow Jones]
1 [Once again <CMPY> surprises all]*
2 [<CMPY>'s vaccine is the best one]
第 1 行 --> 棘手,因为公司的真实名称是 'Visa Inc.' 而不仅仅是 Visa...我真的不知道如何处理它。
我不知道使用标记化句子是否更好:因为在那种情况下我还需要标记化“公司”,例如高盛。
您可以使用
import pandas as pd
import re
stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]
def process_term(term):
t = [x for x in term.split()]
first = t[0]
if len(t) > 1:
first = first + r"(?:\s+{}{}".format("(?:\s+".join(map(re.escape,t[1:])), ")?" * (len(t)-1))
return first
dataframe = pd.DataFrame({'text':['GS is the main company of Dow Jones', 'Once again Visa surprises all', "Johnson & Johnson's vaccine is the best one"]})
rx_symbol = r"\b(?:{})\b".format("|".join([x["symbol"] for x in stocks]))
rx_company = r"\b(?:{})(?!\w)".format("|".join(sorted([process_term(x["company"]) for x in stocks], key=len, reverse=True)))
dataframe['new_text'] = dataframe['text'].str.replace(rx_symbol, r'<TKR>', regex=True)
dataframe['new_text'] = dataframe['new_text'].str.replace(rx_company, r'<CMPY>', regex=True)
>>> dataframe
# => text # new_text
# => 0 GS is the main company of Dow Jones <TKR> is the main company of Dow Jones
# => 1 Once again Visa surprises all Once again <CMPY> surprises all
# => 2 Johnson & Johnson's vaccine is the best one <CMPY>'s vaccine is the best one
简而言之:
- 从
symbol
和company
数据中创建两个正则表达式和 运行 两个replace
操作 symbol
正则表达式很简单,它看起来像\b(?:GS|JPM|TRV|V|AMGN|JNJ)\b
并且匹配括号中的任何替代项作为一个完整的词company
正则表达式遵循 Regular expression to match A, AB, ABC, but not AC. ("starts with"). It looks like\b(?:The(?:\s+Travelers(?:\s+Companies)?)?|Johnson(?:\s+\&(?:\s+Johnson)?)?|JPMorgan(?:\s+Chase)?|Goldman(?:\s+Sachs)?|Visa(?:\s+Inc\.)?|Amgen)(?!\w)
: each company name isre.escape
d, and each subsequent word only matches (optionally) if the previous term word is matched. See the regex demo 中描述的后缀方法。请注意,右侧单词边界设置为(?!\w)
前瞻,因为如果术语以非单词字符结尾,\b
将阻止匹配。