使用正则表达式从R中的字符串中提取特定长度的数值

Question

看起来像是一个重复的问题，但其他答案对我没有帮助。我正在尝试提取文本中的任何 8 位数字。该数字可以在文本中的任何位置。它可以是独立的，也可以跟随或跟随字符串。基本上，我需要从 R 中的字符串中提取任何出现的 8 个连续数字字符，仅使用正则表达式。

这是我尝试过但无济于事的方法：

> my_text <- "the number 5849 and 5555555555 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't. both 12345678JE and RG10293847 should turn up as well."

> ## this doesn't work
    > gsub('(\d{8})', '\1', my_text)
    [1] "the number 5849 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't.both 12345678JE and RG10293847 should turn up as well."

我想要的输出应该提取以下数字：

同时，如果答案包含第二个正则表达式以仅提取第一次出现的 8 位数字，我将不胜感激：

12345654

编辑：我有一个非常大的 table（大约 2 亿行），我需要在一列上对其进行操作。什么是最有效的解决方案？

编辑：我意识到我的文本案例中缺少案例。文本中也有一些数字超过8位，但我只想提取刚好8位的数字。

Answer 1

我们可以使用str_extract_all

stringr::str_extract_all(my_text, "\d{8}")[[1]]
#[1] "12345654" "99119911" "12345678" "10293847"

类似地，在 base R 中我们可以使用 gregexpr 和 regmatches

regmatches(my_text, gregexpr("\d{8}", my_text))[[1]]

要获取最后 8 位数字，我们可以使用

sub('.*(\d{8}).*', '\1', my_text)
#[1] "10293847"

而对于第一个，我们可以使用

sub('.*?(\d{8}).*', '\1', my_text)
#[1] "12345654"

编辑

对于更新后的情况，我们想要匹配恰好 8 位数字（而不是更多），我们可以使用 str_match_all 并在后面进行负向观察

stringr::str_match_all(my_text, "(?<!\d)\d{8}(?!\d)")[[1]][, 1]
#[1] "12345654" "99119911" "12345678" "10293847"

在这里，我们得到8位数字，后面没有数字。

一个简单的选择也可以是从字符串中提取所有数字并仅保留 8 位数字

v1 <- regmatches(my_text, gregexpr("\d+", my_text))[[1]]
v1[nchar(v1) == 8]

Answer 2

我们可以更具体地执行此操作以避免任何边缘情况

library(stringr)
str_extract_all(my_text, "(?<![0-9])[0-9]{8}(?![0-9])")[[1]]
#[1] "12345654" "99119911" "12345678" "10293847"

检查差异

v1 <- "hello8888882343, 888884399, 88888888, 8888888888"
str_extract_all(v1, "\d{8}")
#[[1]]
#[1] "88888823" "88888439" "88888888" "88888888"

这里是提取大于8的连续数的子串，按照OP的post，应该是左

str_extract_all(v1,  "(?<![0-9])[0-9]{8}(?![0-9])")
#[[1]]
#[1] "88888888"

使用正则表达式从R中的字符串中提取特定长度的数值

Extract a numeric value of a specific length from string in R using regex

regex

r

extract

string-length

gsub