使用正则表达式获取引文标题

Question

我对正则表达式不是很熟悉，但我想从引文中提取一篇论文的标题：标题位于年份（例如第一次引用中的 1991 年）和句子中的以下点之间。我在这里用斜体字。

"1Moulds J.M., Nickells M.W., Moulds J.J., et al. (1991) The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera. J. Exp. Med.5:1159-63."

"2Rochowiak A., Niemir Z.I. (2010) The structure and role of CR1 complement receptor in pathology. Pol. Merkur Lekarski. 28:84–88."

"3WHO. Geneva: WHO; 2018. World Malaria Report 2018".

引文存储在“引文”列的数据框 (df) 中输出：

The C3b/C4b receptor is recognized by the Knops, McCoy, Swain-langley, and York blood group antisera

The structure and role of CR1 complement receptor in pathology

我写了一个正则表达式，看起来像这样：

df$citation = sub('[^"]*?)', "", df$citation)
df$citation = sub("\..*", "", df$citation)

关于如何只写一行有什么建议吗？此外，最好有一个正则表达式，如果它没有在括号中找到年份，例如第三次引用，它将删除该引用。可以这样做吗？

Answer 1

鉴于您的一系列要求，您可以使用

sub("^.*?\b(?:19|20)\d{2}\)\s*([^.]+).*", "\1", df$citation, perl=TRUE)

见regex demo

详情

^ - 字符串开头
.*? - 任何 0+ 个字符，换行字符除外，尽可能少
\b(?:19|20)\d{2} - 字边界，19 或 20 和任意两个数字
\) - 一个 ) 字符
\s* - 0+ 个空格
([^.]+) - 第 1 组：.
.* - 任何 0+ 个字符，除了换行字符，尽可能多。

使用正则表达式获取引文标题

getting title of citation with regex

regex

r

gsub