缩写词和带连字符的单词的打印
Printing of abbreviations and hyphenated words
我需要先识别句子中的所有缩写词和带连字符的单词。它们需要在被识别时打印出来。我的代码似乎不能很好地用于此标识。
import re
sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words
print("new sentence:")
print(sent)
print(abbs_)
print(hypns_)
我语料库中的一个句子是:
使用云数据分析环境的 API 和事件驱动架构的 DevOps 自助服务 BI
这个输出是:
new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']
预期输出为:
new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']
您的缩写规则不匹配。你想找到超过 1 个连续大写字母的单词,你可以使用的规则是:
abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations
这将匹配 API 和 BI。
t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"
import re
abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed
print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)
输出:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[] # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service'] # fixed hyphen rule
这很可能找不到像
这样的所有缩写
t = "Prof. Dr. S. Quakernack"
因此您可能需要使用更多数据和 f.e 对其进行调整。 http://www.regex101.com
我建议:
abbs_ = re.findall(r'\b[A-Z]+s?\b', sent) #abbreviations
hypns_ = re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words
"As you know, I got all As in my course".
"As"是缩写吗?如果不是,那么您需要丢弃单个大写字母后跟或不跟 Ss,并且只收集至少 对 ,可选地后跟一个 s,如 APIs.所以,
abbs_ = re.findall(r'\b(?:[A-Z][A-Z]+s?)\b', sent) #abbreviations
需要 \b 以确保您不会因为中间的那对 AG 而收获诸如 ImNotAGirl 之类的东西。
然后你必须得到缩写:一个单词(\w+),然后是至少一个连字符-单词序列:
hypns_= re.findall(r'\b(?:\w+(-\w+)+)\b', sent) #hyphenated words
我需要先识别句子中的所有缩写词和带连字符的单词。它们需要在被识别时打印出来。我的代码似乎不能很好地用于此标识。
import re
sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words
print("new sentence:")
print(sent)
print(abbs_)
print(hypns_)
我语料库中的一个句子是: 使用云数据分析环境的 API 和事件驱动架构的 DevOps 自助服务 BI
这个输出是:
new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']
预期输出为:
new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']
您的缩写规则不匹配。你想找到超过 1 个连续大写字母的单词,你可以使用的规则是:
abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations
这将匹配 API 和 BI。
t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"
import re
abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed
print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)
输出:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[] # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service'] # fixed hyphen rule
这很可能找不到像
这样的所有缩写t = "Prof. Dr. S. Quakernack"
因此您可能需要使用更多数据和 f.e 对其进行调整。 http://www.regex101.com
我建议:
abbs_ = re.findall(r'\b[A-Z]+s?\b', sent) #abbreviations
hypns_ = re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words
"As you know, I got all As in my course".
"As"是缩写吗?如果不是,那么您需要丢弃单个大写字母后跟或不跟 Ss,并且只收集至少 对 ,可选地后跟一个 s,如 APIs.所以,
abbs_ = re.findall(r'\b(?:[A-Z][A-Z]+s?)\b', sent) #abbreviations
需要 \b 以确保您不会因为中间的那对 AG 而收获诸如 ImNotAGirl 之类的东西。
然后你必须得到缩写:一个单词(\w+),然后是至少一个连字符-单词序列:
hypns_= re.findall(r'\b(?:\w+(-\w+)+)\b', sent) #hyphenated words