如何使用字符/长字符串数据折叠和聚合行
How to collapse and aggregate rows with character / long string data
我正在处理一个包含大量需要合并的文本数据的大型数据集。据说有独特的案例/观察结果,但它们确实有重复。问题是,有时重复案例会提供免费的新信息。因此,我想根据条件折叠/合并案例。
我这里有一个非常小的示例数据集,可以说明这个想法。请注意,实际上 varText
的长度通常超过 1000 个字符。
varID
代表目标独特观察
varCat
表示一个分类数据,有时它包含NA,有时它补充一个观察(实际情况下,我有大约10个这样的数据)
varID <- c('a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'c', 'd', 'e', 'a', 'z')
varText <- c('This is a long text', 'This is also a long text',
'This is short', 'This is another unique long text',
'Blabla1', 'Blabla2', 'Blabla3', 'Blabla4', 'Blabla5', 'Blabla6', 'Blabla7',
'Blabla8', 'This is also a long blabla', 'This case is perfectly fine')
varCat <- c('CatA', 'CatB', NA, 'CatC', 'CatA', NA, NA, 'CatC', 'CatA', 'CatB', NA, 'CatC', NA, 'CatF')
df <- data.frame(varID, varText, varCat, stringsAsFactors = FALSE)
样本 df:
varID varText varCat
1 a This is a long text CatA
2 b This is also a long text CatB
3 c This is short <NA>
4 d This is another unique long text CatC
5 e Blabla1 CatA
6 a Blabla2 <NA>
7 b Blabla3 <NA>
8 c Blabla4 CatC
9 d Blabla5 CatA
10 c Blabla6 CatB
11 d Blabla7 <NA>
12 e Blabla8 CatC
13 a This is also a long blabla <NA>
14 z This case is perfectly fine CatF
首先我找出所有重复的案例:
df <- df %>% add_count(varID, name = 'dupe_varID')
那我也想根据长度比较文本:
df$text_length <- stringr::str_length(df$varText)
最后,我创建了一个只有重复案例的新数据框。我想我可以使用 dplyr
中的 group_by
。但我不知道如何从这里开始。
# filter all duplicated cases into new df sort ???
df2 <- df %>% filter(dupe_varID > 1) %>% group_by(varID) %>% arrange(desc(text_length), varCat)
我想要以下结果:
- 最长的
varText
应该保留
- NA 值被替换为非 NA
- 删除重复项
- 如果在
varCat
中存在冲突,则文本最长的案例提供varCat
1 a This is also a long blabla CatA
2 b This is also a long text CatB
3 c This is short CatC
4 d This is another unique long text CatC
5 e Blabla1 CatA
14 z This case is perfectly fine CatF
一个选项是按 'varID' 分组,然后 fill
NA
元素与相邻的非 NA 元素和 slice
具有 [=14 的行=] 'varText'
中的字符数 (nchar
)
library(dplyr)
library(tidyr)
df %>%
group_by(varID) %>%
fill(varCat, .direction = 'downup') %>%
slice(which.max(nchar(varText)))
# A tibble: 6 x 3
# Groups: varID [6]
# varID varText varCat
# <chr> <chr> <chr>
#1 a This is also a long blabla CatA
#2 b This is also a long text CatB
#3 c This is short CatC
#4 d This is another unique long text CatC
#5 e Blabla1 CatA
#6 z This case is perfectly fine CatF
我正在处理一个包含大量需要合并的文本数据的大型数据集。据说有独特的案例/观察结果,但它们确实有重复。问题是,有时重复案例会提供免费的新信息。因此,我想根据条件折叠/合并案例。
我这里有一个非常小的示例数据集,可以说明这个想法。请注意,实际上 varText
的长度通常超过 1000 个字符。
varID
代表目标独特观察
varCat
表示一个分类数据,有时它包含NA,有时它补充一个观察(实际情况下,我有大约10个这样的数据)
varID <- c('a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'c', 'd', 'e', 'a', 'z')
varText <- c('This is a long text', 'This is also a long text',
'This is short', 'This is another unique long text',
'Blabla1', 'Blabla2', 'Blabla3', 'Blabla4', 'Blabla5', 'Blabla6', 'Blabla7',
'Blabla8', 'This is also a long blabla', 'This case is perfectly fine')
varCat <- c('CatA', 'CatB', NA, 'CatC', 'CatA', NA, NA, 'CatC', 'CatA', 'CatB', NA, 'CatC', NA, 'CatF')
df <- data.frame(varID, varText, varCat, stringsAsFactors = FALSE)
样本 df:
varID varText varCat
1 a This is a long text CatA
2 b This is also a long text CatB
3 c This is short <NA>
4 d This is another unique long text CatC
5 e Blabla1 CatA
6 a Blabla2 <NA>
7 b Blabla3 <NA>
8 c Blabla4 CatC
9 d Blabla5 CatA
10 c Blabla6 CatB
11 d Blabla7 <NA>
12 e Blabla8 CatC
13 a This is also a long blabla <NA>
14 z This case is perfectly fine CatF
首先我找出所有重复的案例:
df <- df %>% add_count(varID, name = 'dupe_varID')
那我也想根据长度比较文本:
df$text_length <- stringr::str_length(df$varText)
最后,我创建了一个只有重复案例的新数据框。我想我可以使用 dplyr
中的 group_by
。但我不知道如何从这里开始。
# filter all duplicated cases into new df sort ???
df2 <- df %>% filter(dupe_varID > 1) %>% group_by(varID) %>% arrange(desc(text_length), varCat)
我想要以下结果:
- 最长的
varText
应该保留 - NA 值被替换为非 NA
- 删除重复项
- 如果在
varCat
中存在冲突,则文本最长的案例提供varCat
1 a This is also a long blabla CatA
2 b This is also a long text CatB
3 c This is short CatC
4 d This is another unique long text CatC
5 e Blabla1 CatA
14 z This case is perfectly fine CatF
一个选项是按 'varID' 分组,然后 fill
NA
元素与相邻的非 NA 元素和 slice
具有 [=14 的行=] 'varText'
nchar
)
library(dplyr)
library(tidyr)
df %>%
group_by(varID) %>%
fill(varCat, .direction = 'downup') %>%
slice(which.max(nchar(varText)))
# A tibble: 6 x 3
# Groups: varID [6]
# varID varText varCat
# <chr> <chr> <chr>
#1 a This is also a long blabla CatA
#2 b This is also a long text CatB
#3 c This is short CatC
#4 d This is another unique long text CatC
#5 e Blabla1 CatA
#6 z This case is perfectly fine CatF