正则表达式逗号在 R 数据清理中的使用
Regex comma use in data cleaning with R
根据我之前的一个问题 (),我能够清理几乎所有数据。谢谢你们,你们这些出色的编码员。但是,当我试图了解“游乐场”的工作原理时,我继续 运行 陷入逗号问题。
数据集最初看起来像-
Species Association Year
<fctr> <chr> <dbl>
1 RC SKS/BW NA
2 BW Sykes, rc NA
3 SKS Babo/bw NA
4 RC baboon, mangabey NA
5 Mang red colobus, bw, sykes NA
6 SKS babo/red duiker NA
11 BW r/c monkeys 12
21 RC b/w colobus 12
31 SKS b/w colobus/R/c monkeys 12
41 BW sykes/R/c monkeys 12
51 RC sykes/b/w colobus 12
61 BABO - 12
7 SKS - 12
8 RC - 12
9 SKS r/c monkeys 12
10 RC sykes monkeys 12
53 BW sykes,b/w colobus 12
57 BW r/c monkeys,bw 12
58 Mang sykes,R/c monkeys 12
输入-
dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS",
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC", "BW", "BW", "Mang"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey",
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys", "sykes,b/w colobus", "r/c monkeys,bw", "sykes,R/c monkeys"), year = c(NA, NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", "7", "8", "9", "10", "53", "57", "58"), class = "data.frame")
为了清理,我创建了一个字典,然后使用正则表达式来捕获关联列中除最后三行之外的所有变化,因为它们由“,”而不是“/”分隔
dict <- read.table(header=TRUE, text='
from to
"BABO" BABO
"yellow baboon" BABO
"BW" BW
"bw colobus" BW
"Bw" BW
"bw" BW
"Bw colobus" BW
"B/W COLOBUS" BW
"RC" RC
"RED COLOBUS" RC
"rc monkeys" RC
"Red colobus" RC
"R/C MONKEYS" RC
"Rc monkeys" RC
"MANGABEY" MANG
"MANGA" MANG
"mangabeys" MANG
"SKS" SKS
"SYKES" SKS
"SYKES MONKEYS" SKS
"sykes" SKS
"SYKES MONKEY" SKS
"RED DUIKER" RD
"Red duiker" RD
"Red Duiker + V . Fresh dung" RD
')
regex <- '(?<=\w{2})\/|,\s'
spf <- "%s"
data.frame(from=
sprintf(spf,
sort(unique(unlist(
strsplit(toupper(dat$Association), regex, perl=TRUE)))))) |>
print(row.names=FALSE)
res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
lapply(\(x) dict[match(x, dict$from), ]$to) |>
sapply(toString) |>
{\(.) replace(., . == ".", NA)}() |>
data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
setNames(c('association', 'site', 'species', 'year')) |>
subset(select=c(3, 1, 2, 4))
给我一个最终的数据框 -
Species Association Site Year
<fctr> <chr> <chr> <dbl>
1 RC SKS, BW Protected NA
2 BW SKS, RC Protected NA
3 SKS BABO, BW Protected NA
4 RC BABO, MANG Protected NA
5 MANG RC, BW, SKS Protected NA
6 SKS BABO, RD Protected NA
7 BW RC Protected 12
8 RC BW Protected 12
9 SKS BW, RC Protected 12
10 BW SKS, RC Protected 12
11 RC SKS, BW Protected 12
12 BABO NA Protected 12
13 SKS NA Protected 12
14 RC NA Protected 12
15 SKS RC Protected 12
16 RC SKS Protected 12
17 BW NA Protected 12
18 BW NA Protected 12
19 MANG NA Protected 12
我想包括最后三行以读取正确的关联(即 SKS、BW;RC、BW;SKS、RC),但我正在阅读的有关正则表达式的所有内容都将逗号用作表达式的一部分,而不是作为您在字符串中找到的内容的一部分。有没有办法包含它以便提供正确的输出?我对正则表达式还是很陌生,对 R 也很陌生。非常感谢任何帮助。
问题出在你的词典上。使用tidyverse如下图:
library(tidyverse)
dict1 <- dict %>%
add_row(from = 'BABOON', to = 'BABO') %>%
add_row(from='.', to = NA) %>%
add_row(from = '/', to = ',')
mutate(from = toupper(from))%>%
distinct() %>%
arrange(desc(nchar(from)))
dat %>%
mutate(Association = str_replace_all(toupper(Association),
fixed(setNames(dict1$to, dict1$from))),
Site = 'Protected')
Species Association year Site
1 RC SKS,BW NA Protected
2 BW SKS, RC NA Protected
3 SKS BABO,BW NA Protected
4 RC BABO, MANG NA Protected
5 Mang RC, BW, SKS NA Protected
6 SKS BABO,RD NA Protected
11 BW RC 12 Protected
21 RC BW 12 Protected
31 SKS BW,RC 12 Protected
41 BW SKS,RC 12 Protected
51 RC SKS,BW 12 Protected
61 BABO <NA> 12 Protected
7 SKS <NA> 12 Protected
8 RC <NA> 12 Protected
9 SKS RC 12 Protected
10 RC SKS MONKEYS 12 Protected
53 BW SKS,BW 12 Protected
57 BW RC,BW 12 Protected
58 Mang SKS,RC 12 Protected
根据我之前的一个问题 (
数据集最初看起来像-
Species Association Year
<fctr> <chr> <dbl>
1 RC SKS/BW NA
2 BW Sykes, rc NA
3 SKS Babo/bw NA
4 RC baboon, mangabey NA
5 Mang red colobus, bw, sykes NA
6 SKS babo/red duiker NA
11 BW r/c monkeys 12
21 RC b/w colobus 12
31 SKS b/w colobus/R/c monkeys 12
41 BW sykes/R/c monkeys 12
51 RC sykes/b/w colobus 12
61 BABO - 12
7 SKS - 12
8 RC - 12
9 SKS r/c monkeys 12
10 RC sykes monkeys 12
53 BW sykes,b/w colobus 12
57 BW r/c monkeys,bw 12
58 Mang sykes,R/c monkeys 12
输入-
dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS",
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC", "BW", "BW", "Mang"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey",
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys", "sykes,b/w colobus", "r/c monkeys,bw", "sykes,R/c monkeys"), year = c(NA, NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", "7", "8", "9", "10", "53", "57", "58"), class = "data.frame")
为了清理,我创建了一个字典,然后使用正则表达式来捕获关联列中除最后三行之外的所有变化,因为它们由“,”而不是“/”分隔
dict <- read.table(header=TRUE, text='
from to
"BABO" BABO
"yellow baboon" BABO
"BW" BW
"bw colobus" BW
"Bw" BW
"bw" BW
"Bw colobus" BW
"B/W COLOBUS" BW
"RC" RC
"RED COLOBUS" RC
"rc monkeys" RC
"Red colobus" RC
"R/C MONKEYS" RC
"Rc monkeys" RC
"MANGABEY" MANG
"MANGA" MANG
"mangabeys" MANG
"SKS" SKS
"SYKES" SKS
"SYKES MONKEYS" SKS
"sykes" SKS
"SYKES MONKEY" SKS
"RED DUIKER" RD
"Red duiker" RD
"Red Duiker + V . Fresh dung" RD
')
regex <- '(?<=\w{2})\/|,\s'
spf <- "%s"
data.frame(from=
sprintf(spf,
sort(unique(unlist(
strsplit(toupper(dat$Association), regex, perl=TRUE)))))) |>
print(row.names=FALSE)
res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
lapply(\(x) dict[match(x, dict$from), ]$to) |>
sapply(toString) |>
{\(.) replace(., . == ".", NA)}() |>
data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
setNames(c('association', 'site', 'species', 'year')) |>
subset(select=c(3, 1, 2, 4))
给我一个最终的数据框 -
Species Association Site Year
<fctr> <chr> <chr> <dbl>
1 RC SKS, BW Protected NA
2 BW SKS, RC Protected NA
3 SKS BABO, BW Protected NA
4 RC BABO, MANG Protected NA
5 MANG RC, BW, SKS Protected NA
6 SKS BABO, RD Protected NA
7 BW RC Protected 12
8 RC BW Protected 12
9 SKS BW, RC Protected 12
10 BW SKS, RC Protected 12
11 RC SKS, BW Protected 12
12 BABO NA Protected 12
13 SKS NA Protected 12
14 RC NA Protected 12
15 SKS RC Protected 12
16 RC SKS Protected 12
17 BW NA Protected 12
18 BW NA Protected 12
19 MANG NA Protected 12
我想包括最后三行以读取正确的关联(即 SKS、BW;RC、BW;SKS、RC),但我正在阅读的有关正则表达式的所有内容都将逗号用作表达式的一部分,而不是作为您在字符串中找到的内容的一部分。有没有办法包含它以便提供正确的输出?我对正则表达式还是很陌生,对 R 也很陌生。非常感谢任何帮助。
问题出在你的词典上。使用tidyverse如下图:
library(tidyverse)
dict1 <- dict %>%
add_row(from = 'BABOON', to = 'BABO') %>%
add_row(from='.', to = NA) %>%
add_row(from = '/', to = ',')
mutate(from = toupper(from))%>%
distinct() %>%
arrange(desc(nchar(from)))
dat %>%
mutate(Association = str_replace_all(toupper(Association),
fixed(setNames(dict1$to, dict1$from))),
Site = 'Protected')
Species Association year Site
1 RC SKS,BW NA Protected
2 BW SKS, RC NA Protected
3 SKS BABO,BW NA Protected
4 RC BABO, MANG NA Protected
5 Mang RC, BW, SKS NA Protected
6 SKS BABO,RD NA Protected
11 BW RC 12 Protected
21 RC BW 12 Protected
31 SKS BW,RC 12 Protected
41 BW SKS,RC 12 Protected
51 RC SKS,BW 12 Protected
61 BABO <NA> 12 Protected
7 SKS <NA> 12 Protected
8 RC <NA> 12 Protected
9 SKS RC 12 Protected
10 RC SKS MONKEYS 12 Protected
53 BW SKS,BW 12 Protected
57 BW RC,BW 12 Protected
58 Mang SKS,RC 12 Protected