PYTHON - 如何从文本文件中提取包含引用标记的句子
PYTHON - How to extract sentences containing citation mark from text file
例如,我有 3 个句子,如下所示,其中 1 个句子包含引文标记 (Warren and Pereira, 1982)
。引文始终在括号中,格式如下:(~string~comma(,)~space~number~)
He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was
impressive on its own merits.
我正在使用 Regex 仅提取中间句子,但它会保留打印所有 3 个句子。
结果应该是这样的:
The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."
您可以将文本拆分成一个句子列表,然后选择以“)”结尾的句子。
sentences = text.split(".")[:-1]
for sentence in sentences:
if sentence[-1] == ")":
print sentence
设置...代表感兴趣案例的 2 个句子:
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."
t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."
首先,要匹配引用在句末的情况:
p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
当引用不在句末时匹配:
p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
将两种情况与“|”结合起来正则表达式运算符:
p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
"|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")
运行:
>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]
>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]
在这两种情况下,您都会得到带有引文的句子。
python 正则表达式 documentation and the accompanying regex howto 页面是一个很好的资源。
干杯
例如,我有 3 个句子,如下所示,其中 1 个句子包含引文标记 (Warren and Pereira, 1982)
。引文始终在括号中,格式如下:(~string~comma(,)~space~number~)
He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits.
我正在使用 Regex 仅提取中间句子,但它会保留打印所有 3 个句子。 结果应该是这样的:
The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."
您可以将文本拆分成一个句子列表,然后选择以“)”结尾的句子。
sentences = text.split(".")[:-1]
for sentence in sentences:
if sentence[-1] == ")":
print sentence
设置...代表感兴趣案例的 2 个句子:
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."
t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."
首先,要匹配引用在句末的情况:
p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
当引用不在句末时匹配:
p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"
将两种情况与“|”结合起来正则表达式运算符:
p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
"|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")
运行:
>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]
>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]
在这两种情况下,您都会得到带有引文的句子。
python 正则表达式 documentation and the accompanying regex howto 页面是一个很好的资源。
干杯