R中同一列中的条件字符串连接
Conditional string concatenation in same column in R
我是 R 新手,数据框中有一个非常大的不规则列,如下所示:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation
我需要将此列连接成如下所示:
section
BOOK I: Introduction
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation
基本上目标是根据条件提取上部字符串的值,然后与下部字符串连接使用正则表达式实现值,但我真的不知道该怎么做。
提前致谢。
你可以这样做:
unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))),
function(y) {
if(length(y) == 1) return(y)
else c(y[1], paste(y[1], y[-1], sep = " / "))
}), use.names = FALSE)
#> [1] "BOOK I: Introduction"
#> [2] "BOOK I: Introduction / Page one: presentation"
#> [3] "BOOK I: Introduction / Page two: acknowledgments"
#> [4] "MAGAZINE II: Considerations"
#> [5] "MAGAZINE II: Considerations / Page one: characters"
#> [6] "MAGAZINE II: Considerations / Page two: index"
#> [7] "BOOK III: General Principles"
#> [8] "BOOK III: General Principles"
#> [9] "BOOK III: General Principles / Page one: invitation"
这是一种方法:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
x <- dplyr::mutate(x,
isSection = stringr::str_starts(section, "Page", negate = TRUE),
sectionNum = cumsum(isSection)
) |>
dplyr::group_by(sectionNum) |>
dplyr::mutate(newSection = dplyr::if_else(
condition = isSection,
true = section,
false = paste(dplyr::first(section), section, sep = " / ")
)) |>
ungroup()
x
#> # A tibble: 9 × 4
#> section isSection sectionNum newSection
#> <chr> <lgl> <int> <chr>
#> 1 BOOK I: Introduction TRUE 1 BOOK I: Introduction
#> 2 Page one: presentation FALSE 1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments FALSE 1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations TRUE 2 MAGAZINE II: Considerations
#> 5 Page one: characters FALSE 2 MAGAZINE II: Considerations…
#> 6 Page two: index FALSE 2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE 3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE 4 BOOK III: General Principles
#> 9 Page one: invitation FALSE 4 BOOK III: General Principle…
由 reprex package (v2.0.1)
于 2022-03-25 创建
这里,我们先判断section
是section title还是page title,保存为TRUE
或FALSE
。
然后,我们使用cumsum()
(累计和)标记属于一个部分的页面。当我们将 TRUE
和 FALSE
值相加时,TRUE
(此处为部分)变为 1
并递增累计总和,但 FALSE
(此处为页面)变为 0
并且不增加累计总和,因此特定部分中的所有页面都会收到相同的值。
最后,我们创建一个新的节变量,这次使用group_by()
和if_else()
来有条件地设置值。如果 isSection
是 TRUE
,我们只保留 section
(部分标题)的现有值。如果 isSection
是 FALSE
,我们将组中 section
的第一个值与 section
的现有值连接起来,用 " / "
.[=32= 分隔]
使用 data.table:
library(data.table)
setDT(x)[grepl("^Page.",section)==F, header:=section] %>%
.[,header:=zoo::na.locf(header)] %>%
.[section!=header,header:=paste0(header, " / ",section)] %>%
.[,.(section = header)] %>%
.[]
1: BOOK I: Introduction
2: BOOK I: Introduction / Page one: presentation
3: BOOK I: Introduction / Page two: acknowledgments
4: MAGAZINE II: Considerations
5: MAGAZINE II: Considerations / Page one: characters
6: MAGAZINE II: Considerations / Page two: index
7: BOOK III: General Principles
8: BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation
滚动连接可以实现这一点。在 data.table:
library( data.table )
# add a row column for joining by reference
x[ , row := .I ]
# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
.(row, book_magazine = section) ]
# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]
# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
section_string = fifelse( book_magazine == section,
book_magazine,
sprintf("%s / %s", book_magazine, section) )
) ]
这给出:
> result$section_string
[1] "BOOK I: Introduction"
[2] "BOOK I: Introduction / Page one: presentation"
[3] "BOOK I: Introduction / Page two: acknowledgments"
[4] "MAGAZINE II: Considerations"
[5] "MAGAZINE II: Considerations / Page one: characters"
[6] "MAGAZINE II: Considerations / Page two: index"
[7] "BOOK III: General Principles"
[8] "BOOK III: General Principles"
[9] "BOOK III: General Principles / Page one: invitation"
稍微简单一些的data.table
方法:
library(data.table)
setDT(x)
x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]
输出为:
> x
section
1: BOOK I: Introduction
2: BOOK I: Introduction / Page one: presentation
3: BOOK I: Introduction / Page two: acknowledgments
4: MAGAZINE II: Considerations
5: MAGAZINE II: Considerations / Page one: characters
6: MAGAZINE II: Considerations / Page two: index
7: BOOK III: General Principles
8: BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation
我是 R 新手,数据框中有一个非常大的不规则列,如下所示:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation
我需要将此列连接成如下所示:
section
BOOK I: Introduction
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation
基本上目标是根据条件提取上部字符串的值,然后与下部字符串连接使用正则表达式实现值,但我真的不知道该怎么做。
提前致谢。
你可以这样做:
unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))),
function(y) {
if(length(y) == 1) return(y)
else c(y[1], paste(y[1], y[-1], sep = " / "))
}), use.names = FALSE)
#> [1] "BOOK I: Introduction"
#> [2] "BOOK I: Introduction / Page one: presentation"
#> [3] "BOOK I: Introduction / Page two: acknowledgments"
#> [4] "MAGAZINE II: Considerations"
#> [5] "MAGAZINE II: Considerations / Page one: characters"
#> [6] "MAGAZINE II: Considerations / Page two: index"
#> [7] "BOOK III: General Principles"
#> [8] "BOOK III: General Principles"
#> [9] "BOOK III: General Principles / Page one: invitation"
这是一种方法:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
x <- dplyr::mutate(x,
isSection = stringr::str_starts(section, "Page", negate = TRUE),
sectionNum = cumsum(isSection)
) |>
dplyr::group_by(sectionNum) |>
dplyr::mutate(newSection = dplyr::if_else(
condition = isSection,
true = section,
false = paste(dplyr::first(section), section, sep = " / ")
)) |>
ungroup()
x
#> # A tibble: 9 × 4
#> section isSection sectionNum newSection
#> <chr> <lgl> <int> <chr>
#> 1 BOOK I: Introduction TRUE 1 BOOK I: Introduction
#> 2 Page one: presentation FALSE 1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments FALSE 1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations TRUE 2 MAGAZINE II: Considerations
#> 5 Page one: characters FALSE 2 MAGAZINE II: Considerations…
#> 6 Page two: index FALSE 2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE 3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE 4 BOOK III: General Principles
#> 9 Page one: invitation FALSE 4 BOOK III: General Principle…
由 reprex package (v2.0.1)
于 2022-03-25 创建这里,我们先判断section
是section title还是page title,保存为TRUE
或FALSE
。
然后,我们使用cumsum()
(累计和)标记属于一个部分的页面。当我们将 TRUE
和 FALSE
值相加时,TRUE
(此处为部分)变为 1
并递增累计总和,但 FALSE
(此处为页面)变为 0
并且不增加累计总和,因此特定部分中的所有页面都会收到相同的值。
最后,我们创建一个新的节变量,这次使用group_by()
和if_else()
来有条件地设置值。如果 isSection
是 TRUE
,我们只保留 section
(部分标题)的现有值。如果 isSection
是 FALSE
,我们将组中 section
的第一个值与 section
的现有值连接起来,用 " / "
.[=32= 分隔]
使用 data.table:
library(data.table)
setDT(x)[grepl("^Page.",section)==F, header:=section] %>%
.[,header:=zoo::na.locf(header)] %>%
.[section!=header,header:=paste0(header, " / ",section)] %>%
.[,.(section = header)] %>%
.[]
1: BOOK I: Introduction
2: BOOK I: Introduction / Page one: presentation
3: BOOK I: Introduction / Page two: acknowledgments
4: MAGAZINE II: Considerations
5: MAGAZINE II: Considerations / Page one: characters
6: MAGAZINE II: Considerations / Page two: index
7: BOOK III: General Principles
8: BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation
滚动连接可以实现这一点。在 data.table:
library( data.table )
# add a row column for joining by reference
x[ , row := .I ]
# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
.(row, book_magazine = section) ]
# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]
# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
section_string = fifelse( book_magazine == section,
book_magazine,
sprintf("%s / %s", book_magazine, section) )
) ]
这给出:
> result$section_string
[1] "BOOK I: Introduction"
[2] "BOOK I: Introduction / Page one: presentation"
[3] "BOOK I: Introduction / Page two: acknowledgments"
[4] "MAGAZINE II: Considerations"
[5] "MAGAZINE II: Considerations / Page one: characters"
[6] "MAGAZINE II: Considerations / Page two: index"
[7] "BOOK III: General Principles"
[8] "BOOK III: General Principles"
[9] "BOOK III: General Principles / Page one: invitation"
稍微简单一些的data.table
方法:
library(data.table)
setDT(x)
x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]
输出为:
> x
section
1: BOOK I: Introduction
2: BOOK I: Introduction / Page one: presentation
3: BOOK I: Introduction / Page two: acknowledgments
4: MAGAZINE II: Considerations
5: MAGAZINE II: Considerations / Page one: characters
6: MAGAZINE II: Considerations / Page two: index
7: BOOK III: General Principles
8: BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation