向上合并行，同时缺少列单元格中的值

Question

我从格式不正确的 pdf table 中读取了一些数据，其中的单元格有时跨越多个页面。这给我留下了一个看起来与此类似的数据框：

company_name <- c("company_a", NA, "company_a", "company_b", "company_b", NA)
text <- c("some_text", "text that should be in the above cell","some_text",  "some_text", "some_text","text that should be in the above cell")
more_text <- c("some_text", "text that should be in the above cell", "some_text",  "some_text", "some_text","text that should be in the above cell")
df <- data.frame(company_name, text, more_text)

company_name	text	more_text
company_a	some_text	some_text
NA	text that should be in the above cell	text that should be in the above cell
company_a	some_text	some_text
company_b	some_text	some_text
company_b	some_text	some_text
NA	text that should be in the above cell	text that should be in the above cell

我如何合并具有缺失值的行，而“company_name”应该是这样的，所以它看起来更像这样，并且还循环遍历以 NA:

开头的所有行

company_name	text	more_text
company_a	some_text + text that should be in the above cell	some_text + text that should be in the above cell
company_a	some_text	some_text
company_b	some_text	some_text
company_b	some_text + text that should be in the above cell	some_text + text that should be in the above cell

我试过 unheadr 包，但我似乎无法找出正确的函数来使用。

编辑：为了更清晰，重新做了示例

Answer 1

我们根据 NA 元素 (ind) 创建一个逻辑列，然后通过将 'ind' 或 (|) 转换为 [= 创建 'grp'该列的 14=] 到带有 rleid 的数字索引，使用 fill 将 NA 元素替换为之前在 'company_name 中的非 NA，然后使用分组列和 summarise across 其他列 paste 将元素组合在一起

library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df %>% 
   mutate(ind = is.na(company_name), 
       grp = rleid(ind|lead(ind))) %>%
    fill(company_name) %>% 
    group_by(company_name, grp) %>% 
    summarise(across(contains('text'), str_c, collapse=" + "), .groups = 'drop') %>% 
    select(-grp)
# A tibble: 4 x 3
#  company_name text                                              more_text                                        
#  <chr>        <chr>                                             <chr>                                            
#1 company_a    some_text + text that should be in the above cell some_text + text that should be in the above cell
#2 company_a    some_text                                         some_text                                        
#3 company_b    some_text                                         some_text                                        
#4 company_b    some_text + text that should be in the above cell some_text + text that should be in the above cell

数据

df <- data.frame(company_name = company_a, text, more_text)

向上合并行，同时缺少列单元格中的值

Merge rows upwards, while missing values in cells of column

merge

r

rows

na

数据