Python 正则表达式获取论文中的引用
Python regex to get citations in a paper
我正在改编 this code 以从文本中提取引文:
#!/usr/bin/env python3
#
import re
from sys import stdin
text = stdin.read()
author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?" # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"
matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()
#print(matches)
print ("\n".join(matches))
但是,它会将一些大写的单词识别为作者姓名。比如文中:
Although James (2020) recognized blablabla, Smith et al. (2020) found mimimi.
Those inconsistent results are a sign of lalala (Green, 2010; Grimm, 1990).
Also James (2020) ...
输出将是
Also James (2020)
Although James (2020)
Green, 2010
Grimm, 1990
Smith et al. (2020)
有没有办法在不删除整个匹配项的情况下将上述代码中的某些单词“列入黑名单”?我希望它承认 James 的工作,但从引用中删除了“Also”和“Although”。
提前致谢。
您可以使用
author = r"(?:[A-Z][A-Za-z'`-]+)"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9]+)?" # Always optional
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
matches = re.findall(regex, text)
参见Python demo and the resulting regex demo。
主要区别在于regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
,如果紧靠右边的单词是Although
或Also
,\b(?!(?:Although|Also)\b)
部分将失败。
此外,请注意我转义了应该与文字点匹配的点,并使用 f-strings 使代码看起来更紧凑。
这是我的答案,之前的答案对某些引用无效。
regexr.com/6er6n
这个答案是我从其他来源得到的,但它不适用于另一种类型的引文文本。我的版本修复了:
citationsRegex = r"\b(?!(?:Although|Also)\b)(?:[A-Z][A-Za-z'`-]+)(?:,? (?:(?:and |& )?(?:[A-Z][A-Za-z'`-]+)|(?:et al.?)))*(?:,? *(?:19|20)[0-9][0-9](?:, p\.? [0-9]+)?| *\((?:19|20)[0-9][0-9](?:, p\.? [0-9]+)?\))"
我正在改编 this code 以从文本中提取引文:
#!/usr/bin/env python3
#
import re
from sys import stdin
text = stdin.read()
author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?" # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"
matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()
#print(matches)
print ("\n".join(matches))
但是,它会将一些大写的单词识别为作者姓名。比如文中:
Although James (2020) recognized blablabla, Smith et al. (2020) found mimimi.
Those inconsistent results are a sign of lalala (Green, 2010; Grimm, 1990).
Also James (2020) ...
输出将是
Also James (2020)
Although James (2020)
Green, 2010
Grimm, 1990
Smith et al. (2020)
有没有办法在不删除整个匹配项的情况下将上述代码中的某些单词“列入黑名单”?我希望它承认 James 的工作,但从引用中删除了“Also”和“Although”。
提前致谢。
您可以使用
author = r"(?:[A-Z][A-Za-z'`-]+)"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9]+)?" # Always optional
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
matches = re.findall(regex, text)
参见Python demo and the resulting regex demo。
主要区别在于regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
,如果紧靠右边的单词是Although
或Also
,\b(?!(?:Although|Also)\b)
部分将失败。
此外,请注意我转义了应该与文字点匹配的点,并使用 f-strings 使代码看起来更紧凑。
这是我的答案,之前的答案对某些引用无效。
regexr.com/6er6n
这个答案是我从其他来源得到的,但它不适用于另一种类型的引文文本。我的版本修复了:
citationsRegex = r"\b(?!(?:Although|Also)\b)(?:[A-Z][A-Za-z'`-]+)(?:,? (?:(?:and |& )?(?:[A-Z][A-Za-z'`-]+)|(?:et al.?)))*(?:,? *(?:19|20)[0-9][0-9](?:, p\.? [0-9]+)?| *\((?:19|20)[0-9][0-9](?:, p\.? [0-9]+)?\))"