如何使用正则表达式从文本中提取由标签分隔的多个引用？

Question

我有一个包含引文的手动输入文件，每个文件的格式为：

< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>

这是我目前使用 python 的 re 模块的方法：

citance = citance[citance.find(">")+1:citance.rfind("<")]
fd.write(citance+"\n")

我正在尝试提取从第一个 angular 括号（“>”）到最后一个 angular 括号（“<”）的所有内容。但是，在多次引用的情况下，这种方法会失败，因为中间标签也会在输出中被提取出来：

It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.

我想要的输出：

It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier. Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.

如何正确实施？

Answer 1

我会选择 python 正则表达式模块：re 通过做：

re.findall(r'\">(.*?)<', text_to_parse)

这个方法会return从一个引号到多个引号，但是之后如果你想要一个统一的文本就可以加入它们(" ".join(....))

Answer 2

不使用 re 模块，而是查看 bs4 库。

这是一个 XML/HTML 解析器，因此您可以获得标签之间的所有内容。

对你来说，它会是这样的：

from bs4 import BeautifulSoup

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

text_soup = BeautifulSoup(xml_text, 'lxml')

output = text_soup.find_all('S', attrs = {'sid': '2'})

输出将包含文本：

It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.

此外，如果您只想删除 html 个标签：

import re

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

re.sub('<.*?>', '', html_text)

会完成任务的。

Answer 3

我想这就是您要找的。

import re

string = ">here is some text<>here is some more text<"
matches = re.findall(">(.*?)<", string)
for match in matches: print match

您似乎无法获得太多结果。 "here is some more text<" 的匹配可能是从字符串中的第一个字符到最后一个字符，因为它们是“>”和“<”，而忽略了中间的字符。这 '。*？'成语会让它找到最大的命中数。

如何使用正则表达式从文本中提取由标签分隔的多个引用？

How to extract multiple citations separated by tags from a text using regular expression?

python

regex

citations

python-3.x