向上合并行,同时缺少列单元格中的值

Merge rows upwards, while missing values in cells of column

我从格式不正确的 pdf table 中读取了一些数据,其中的单元格有时跨越多个页面。这给我留下了一个看起来与此类似的数据框:

company_name <- c("company_a", NA, "company_a", "company_b", "company_b", NA)
text <- c("some_text", "text that should be in the above cell","some_text",  "some_text", "some_text","text that should be in the above cell")
more_text <- c("some_text", "text that should be in the above cell", "some_text",  "some_text", "some_text","text that should be in the above cell")
df <- data.frame(company_name, text, more_text)
company_name text more_text
company_a some_text some_text
NA text that should be in the above cell text that should be in the above cell
company_a some_text some_text
company_b some_text some_text
company_b some_text some_text
NA text that should be in the above cell text that should be in the above cell

我如何合并具有缺失值的行,而“company_name”应该是这样的,所以它看起来更像这样,并且还循环遍历以 NA:

开头的所有行
company_name text more_text
company_a some_text + text that should be in the above cell some_text + text that should be in the above cell
company_a some_text some_text
company_b some_text some_text
company_b some_text + text that should be in the above cell some_text + text that should be in the above cell

我试过 unheadr 包,但我似乎无法找出正确的函数来使用。

编辑:为了更清晰,重新做了示例

我们根据 NA 元素 (ind) 创建一个逻辑列,然后通过将 'ind' 或 (|) 转换为 [= 创建 'grp'该列的 14=] 到带有 rleid 的数字索引,使用 fillNA 元素替换为之前在 'company_name 中的非 NA,然后使用分组列和 summarise across 其他列 paste 将元素组合在一起

library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df %>% 
   mutate(ind = is.na(company_name), 
       grp = rleid(ind|lead(ind))) %>%
    fill(company_name) %>% 
    group_by(company_name, grp) %>% 
    summarise(across(contains('text'), str_c, collapse=" + "), .groups = 'drop') %>% 
    select(-grp)
# A tibble: 4 x 3
#  company_name text                                              more_text                                        
#  <chr>        <chr>                                             <chr>                                            
#1 company_a    some_text + text that should be in the above cell some_text + text that should be in the above cell
#2 company_a    some_text                                         some_text                                        
#3 company_b    some_text                                         some_text                                        
#4 company_b    some_text + text that should be in the above cell some_text + text that should be in the above cell

数据

df <- data.frame(company_name = company_a, text, more_text)