用于捕获科学引文的正则表达式

RegEx for capturing scientific citations

我正在尝试捕获其中至少有一位数字的文本括号(想想引文)。这是我的正则表达式,它工作正常:https://regex101.com/r/oOHPvO/5

\((?=.*\d).+?\)

所以我希望它捕获 (Author 2000)(2000) 而不是 (Author).

我正在尝试使用 python 来捕获所有这些括号,但在 python 中它也会捕获括号中的文本,即使它们没有数字。

import re

with open('text.txt') as f:
    f = f.read()

s = "\((?=.*\d).*?\)"

citations = re.findall(s, f)

citations = list(set(citations))

for c in citations:
    print (c)

知道我做错了什么吗?

可能处理此表达式的最可靠方法可能是添加边界,因为您的表达式可能会增长。例如,我们可以尝试创建字符列表,我们希望在其中收集所需数据:

(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\)).

DEMO

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))."

test_str = "some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)"

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

演示

const regex = /(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))./mgi;
const str = `some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

正则表达式电路

jex.im 可视化正则表达式:

您可以使用

re.findall(r'\([^()\d]*\d[^()]*\)', s)

regex demo

详情

  • \( - 一个 ( 字符
  • [^()\d]* - 除了 () 和数字
  • 之外的 0 个或更多字符
  • \d - 一个数字
  • [^()]* - ()
  • 以外的 0 个或更多字符
  • \) - 一个 ) 字符。

参见 regex graph:

Python demo:

import re
rx = re.compile(r"\([^()\d]*\d[^()]*\)")
s = "Some (Author) and (Author 2000)"
print(rx.findall(s)) # => ['(Author 2000)']

要获得不带括号的结果,请添加捕获组:

rx = re.compile(r"\(([^()\d]*\d[^()]*)\)")
                    ^                ^

参见 this Python demo