向上合并行,同时缺少列单元格中的值
Merge rows upwards, while missing values in cells of column
我从格式不正确的 pdf table 中读取了一些数据,其中的单元格有时跨越多个页面。这给我留下了一个看起来与此类似的数据框:
company_name <- c("company_a", NA, "company_a", "company_b", "company_b", NA)
text <- c("some_text", "text that should be in the above cell","some_text", "some_text", "some_text","text that should be in the above cell")
more_text <- c("some_text", "text that should be in the above cell", "some_text", "some_text", "some_text","text that should be in the above cell")
df <- data.frame(company_name, text, more_text)
company_name
text
more_text
company_a
some_text
some_text
NA
text that should be in the above cell
text that should be in the above cell
company_a
some_text
some_text
company_b
some_text
some_text
company_b
some_text
some_text
NA
text that should be in the above cell
text that should be in the above cell
我如何合并具有缺失值的行,而“company_name”应该是这样的,所以它看起来更像这样,并且还循环遍历以 NA:
开头的所有行
company_name
text
more_text
company_a
some_text + text that should be in the above cell
some_text + text that should be in the above cell
company_a
some_text
some_text
company_b
some_text
some_text
company_b
some_text + text that should be in the above cell
some_text + text that should be in the above cell
我试过 unheadr
包,但我似乎无法找出正确的函数来使用。
编辑:为了更清晰,重新做了示例
我们根据 NA 元素 (ind
) 创建一个逻辑列,然后通过将 'ind' 或 (|
) 转换为 [= 创建 'grp'该列的 14=] 到带有 rleid
的数字索引,使用 fill
将 NA
元素替换为之前在 'company_name 中的非 NA,然后使用分组列和 summarise
across
其他列 paste
将元素组合在一起
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df %>%
mutate(ind = is.na(company_name),
grp = rleid(ind|lead(ind))) %>%
fill(company_name) %>%
group_by(company_name, grp) %>%
summarise(across(contains('text'), str_c, collapse=" + "), .groups = 'drop') %>%
select(-grp)
# A tibble: 4 x 3
# company_name text more_text
# <chr> <chr> <chr>
#1 company_a some_text + text that should be in the above cell some_text + text that should be in the above cell
#2 company_a some_text some_text
#3 company_b some_text some_text
#4 company_b some_text + text that should be in the above cell some_text + text that should be in the above cell
数据
df <- data.frame(company_name = company_a, text, more_text)
我从格式不正确的 pdf table 中读取了一些数据,其中的单元格有时跨越多个页面。这给我留下了一个看起来与此类似的数据框:
company_name <- c("company_a", NA, "company_a", "company_b", "company_b", NA)
text <- c("some_text", "text that should be in the above cell","some_text", "some_text", "some_text","text that should be in the above cell")
more_text <- c("some_text", "text that should be in the above cell", "some_text", "some_text", "some_text","text that should be in the above cell")
df <- data.frame(company_name, text, more_text)
company_name | text | more_text |
---|---|---|
company_a | some_text | some_text |
NA | text that should be in the above cell | text that should be in the above cell |
company_a | some_text | some_text |
company_b | some_text | some_text |
company_b | some_text | some_text |
NA | text that should be in the above cell | text that should be in the above cell |
我如何合并具有缺失值的行,而“company_name”应该是这样的,所以它看起来更像这样,并且还循环遍历以 NA:
开头的所有行company_name | text | more_text |
---|---|---|
company_a | some_text + text that should be in the above cell | some_text + text that should be in the above cell |
company_a | some_text | some_text |
company_b | some_text | some_text |
company_b | some_text + text that should be in the above cell | some_text + text that should be in the above cell |
我试过 unheadr
包,但我似乎无法找出正确的函数来使用。
编辑:为了更清晰,重新做了示例
我们根据 NA 元素 (ind
) 创建一个逻辑列,然后通过将 'ind' 或 (|
) 转换为 [= 创建 'grp'该列的 14=] 到带有 rleid
的数字索引,使用 fill
将 NA
元素替换为之前在 'company_name 中的非 NA,然后使用分组列和 summarise
across
其他列 paste
将元素组合在一起
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df %>%
mutate(ind = is.na(company_name),
grp = rleid(ind|lead(ind))) %>%
fill(company_name) %>%
group_by(company_name, grp) %>%
summarise(across(contains('text'), str_c, collapse=" + "), .groups = 'drop') %>%
select(-grp)
# A tibble: 4 x 3
# company_name text more_text
# <chr> <chr> <chr>
#1 company_a some_text + text that should be in the above cell some_text + text that should be in the above cell
#2 company_a some_text some_text
#3 company_b some_text some_text
#4 company_b some_text + text that should be in the above cell some_text + text that should be in the above cell
数据
df <- data.frame(company_name = company_a, text, more_text)