R中同一列中的条件字符串连接

Question

我是 R 新手，数据框中有一个非常大的不规则列，如下所示：

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations 
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation

我需要将此列连接成如下所示：

section
BOOK I: Introduction 
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation

基本上目标是根据条件提取上部字符串的值，然后与下部字符串连接使用正则表达式实现值，但我真的不知道该怎么做。

提前致谢。

Answer 1

你可以这样做：

unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))), 
              function(y) {
                  if(length(y) == 1) return(y)
                  else c(y[1], paste(y[1], y[-1], sep = " / "))
                }), use.names = FALSE)
#> [1] "BOOK I: Introduction"                               
#> [2] "BOOK I: Introduction / Page one: presentation"      
#> [3] "BOOK I: Introduction / Page two: acknowledgments"   
#> [4] "MAGAZINE II: Considerations"                        
#> [5] "MAGAZINE II: Considerations / Page one: characters" 
#> [6] "MAGAZINE II: Considerations / Page two: index"      
#> [7] "BOOK III: General Principles"                       
#> [8] "BOOK III: General Principles"                       
#> [9] "BOOK III: General Principles / Page one: invitation"

Answer 2

这是一种方法：

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

x <- dplyr::mutate(x,
  isSection = stringr::str_starts(section, "Page", negate = TRUE),
  sectionNum = cumsum(isSection)
) |> 
  dplyr::group_by(sectionNum) |> 
  dplyr::mutate(newSection = dplyr::if_else(
    condition = isSection, 
    true = section, 
    false = paste(dplyr::first(section), section, sep = " / ")
  )) |>
  ungroup()

x
#> # A tibble: 9 × 4
#>   section                      isSection sectionNum newSection                  
#>   <chr>                        <lgl>          <int> <chr>                       
#> 1 BOOK I: Introduction         TRUE               1 BOOK I: Introduction        
#> 2 Page one: presentation       FALSE              1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments    FALSE              1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations  TRUE               2 MAGAZINE II: Considerations 
#> 5 Page one: characters         FALSE              2 MAGAZINE II: Considerations…
#> 6 Page two: index              FALSE              2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE               3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE               4 BOOK III: General Principles
#> 9 Page one: invitation         FALSE              4 BOOK III: General Principle…

^{由 reprex package (v2.0.1)}

于 2022-03-25 创建

这里，我们先判断section是section title还是page title，保存为TRUE或FALSE。

然后，我们使用cumsum()（累计和）标记属于一个部分的页面。当我们将 TRUE 和 FALSE 值相加时，TRUE（此处为部分）变为 1 并递增累计总和，但 FALSE（此处为页面）变为 0 并且不增加累计总和，因此特定部分中的所有页面都会收到相同的值。

最后，我们创建一个新的节变量，这次使用group_by()和if_else()来有条件地设置值。如果 isSection 是 TRUE，我们只保留 section（部分标题）的现有值。如果 isSection 是 FALSE，我们将组中 section 的第一个值与 section 的现有值连接起来，用 " / ".[=32= 分隔]

Answer 3

使用 data.table:

library(data.table)

setDT(x)[grepl("^Page.",section)==F, header:=section] %>% 
  .[,header:=zoo::na.locf(header)] %>% 
  .[section!=header,header:=paste0(header, " / ",section)] %>% 
  .[,.(section = header)] %>% 
  .[]

1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

Answer 4

滚动连接可以实现这一点。在 data.table:


library( data.table )

# add a row column for joining by reference
x[ , row := .I ]

# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
                      .(row, book_magazine = section) ]

# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]

# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
    section_string = fifelse( book_magazine == section,
                              book_magazine,
                              sprintf("%s / %s", book_magazine, section) )
) ]

这给出：

> result$section_string

[1] "BOOK I: Introduction"                               
[2] "BOOK I: Introduction / Page one: presentation"      
[3] "BOOK I: Introduction / Page two: acknowledgments"   
[4] "MAGAZINE II: Considerations"                        
[5] "MAGAZINE II: Considerations / Page one: characters" 
[6] "MAGAZINE II: Considerations / Page two: index"      
[7] "BOOK III: General Principles"                       
[8] "BOOK III: General Principles"                       
[9] "BOOK III: General Principles / Page one: invitation"

Answer 5

稍微简单一些的data.table方法：

library(data.table)
setDT(x)

x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
    section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]

输出为：

> x
                                               section
1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

R中同一列中的条件字符串连接

Conditional string concatenation in same column in R

string

r

stringi