如何在R中的字符串之前提取固定数量的字符

Question

我有一段文本在文档的某处包含对法庭案件的引用，例如

x <- "2009 U.S. LEXIS"

我知道它总是一个四位数的年份加上模式“U.S.LEXIS”前面的space。我应该如何提取这四位年份？

谢谢

Answer 1

你可以试试:

x <- "2009 U.S. LEXIS"
as.numeric(sub('.*?(\d{4}) U.S. LEXIS', '\1', x))
#[1] 2009

使用stringr::str_extract：

as.numeric(stringr::str_extract(x, '\d{4}(?= U.S. LEXIS)'))

Answer 2

我们可以使用 parse_number 来自 readr

library(readr)
parse_number(x)
#[1] 2009

数据

x <- "2009 U.S. LEXIS"

Answer 3

stringr库中的substr函数解决

substr(x,1,4)

如果你需要转换成数字，那么你可以return它as.numeric

as.numeric(substr(x,1,4))

Answer 4

我觉得你给的data/vector不足以让这里的专家解决你的问题

更新试试这个

str_extract_all(x, "\d{4}(?=\sU.S.\sLEXIS)")

[[1]]
[1] "2009" "2015" "1990"

或将这些提取为数字

lapply(str_extract_all(x, "\d{4}(?=\sU.S.\sLEXIS)"), as.numeric)

[[1]]
[1] 2009 2015 1990

旧答案 此外，我也是 regex 的新手，因此我的解决方案可能不是一个非常干净的方法。通常，您的情况是以 regex 模式搜索嵌套组。不过，你可以试试这个方法

x <- "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"

> x
[1] "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"

现在按照这些步骤

library(stringr)
lapply(str_extract_all(x, "(\d{4})\sU.S.\sLEXIS"), str_extract, pattern = "(\d{4})")

[[1]]
[1] "2009" "2015" "1990"

通常 "((\d{4})\sU.S.\sLEXIS)" 可以用作正则表达式模式，但我确信 R 中的嵌套组，所以在这里使用 lapply。基本上 str_extract_all(x, "(\d{4})\sU.S.\sLEXIS" 将导致 return 所有引用。试试这个。

如何在R中的字符串之前提取固定数量的字符

How to extract a fixed number of characters before a string in R

regex

r

stringr

数据