半不规则文本串中间数字的高效提取

Question

我想从变化很小的文本字符串中间提取一个一位数。所需数字之前的字符数有时是 4 长，有时是 5 长。有时在所需数字后有一个“[字母].docx”，有时只是“.docx”。

我已经写了一个蛮力解决方案，但我想学习如何更优雅地完成它，有两个具体问题。

两个问题：

如何更通用地编写下面的正则表达式语言？我能够在我的案例中使用蛮力，因为我只有十种变体，但我希望看到一个通用的解决方案。
为什么 array() 选项不起作用？我正在尝试实现我所理解的描述 here。出于某种原因，在我的例子中，R returns 在替换数组的第三个元素之后出现错误。

数据：

data$file
XX12_1a.docx
XX4_1b.docx
XX35_4.docx
XX9_3.docx
XX21_2.docx

目标：

data$id
1
1
4
3
2

SSCCE：

require('tidyverse')

data <- data.frame(file = c('XX12_1a.docx',
               'XX4_1b.docx',
               'XX35_4.docx',
               'XX9_3.docx',
               'XX21_2.docx'))

# Brute force solution:
data$id <- str_replace(data$file, '.....1a.....', '1')
data$id <- str_replace(data$id, '.....1b.....', '1')
data$id <- str_replace(data$id, '.....2.....', '2')
data$id <- str_replace(data$id, '.....3.....', '3')
data$id <- str_replace(data$id, '.....4.....', '4')
data$id <- str_replace(data$id, '....1a.....', '1')
data$id <- str_replace(data$id, '....1b.....', '1')
data$id <- str_replace(data$id, '....2.....', '2')
data$id <- str_replace(data$id, '....3.....', '3')
data$id <- str_replace(data$id, '....4.....', '4')

# More concise attempt, does not run
data$id2 <- str_replace(data$file, 
            array('.....1a.....', 
                  '.....1b.....', 
                  '.....2.....', 
                  '.....3.....',
                  '.....4.....',
                  '....1a.....',
                  '....1b.....',
                  '....2.....',
                  '....3.....',
                  '....4.....'), 
            array('1', '1', '2', '3', '4', '1', '1', '2', '3', '4'))

Answer 1

你可以在这里使用 sub:

data <- data.frame(file=c("XX12_1a.docx", "XX4_1b.docx", "XX35_4.docx", "XX9_3.docx", "XX21_2.docx"))
data$id <- sub("^.*_(\d+).*$", "\1", data$file)
data

          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

Answer 2

你可以使用 extract:

library(tidyverse)
data <- data %>%
   extract(file, 'id', '_(\d+)', remove = FALSE)
          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

Answer 3

来自 base R

的 trimws 选项

data$id <- trimws(data$file, whitespace = ".*_|\D?\..*")

-输出

> data
          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

数据

data <- structure(list(file = c("XX12_1a.docx", "XX4_1b.docx", "XX35_4.docx", 
"XX9_3.docx", "XX21_2.docx")), class = "data.frame", row.names = c(NA, 
-5L))

Answer 4

因为目标数字是，从你的例子看来，总是前面有_你可以使用lookbehind：

library(stringr)
str_extract(data$file, "(?<=_)\d")

Answer 5

这是一个 tidyverse 解决方案：

library(tidyverse)
data %>% 
  separate(file, c("split1", "split2"), remove=FALSE) %>% 
  mutate(id = parse_number(split2), .keep="unused") %>% 
  select(-split1)

输出：

          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

半不规则文本串中间数字的高效提取

Efficient extraction of number in middle of semi-irregular text string

regex

r

stringr

数据