Pivot_longer 通过整合多列

Pivot longer by integrating mutiple columns

我想将数据转为长格式并整合多列信息。

示例数据: 假设我们观察了一家网上商店的 4 种产品 (id 1:4) 和不同客户的评论评论 (comment*)。一个产品(id = 1)只有一个评论评论,而另一个产品(id = 4)有 4 个评论。对于每个评论,我们还观察该评论是否引用了另一个用户的评论(如果是,则为 1,否则为 0)。

data = data.frame(id = c(1,2,3,4), n_comments = c(2,1,3,4),
                   comment1 = c("consetetur sadipscing", "Lorem ipsum", "dolor sit ame", "nonumy eirmod "), comment1_quote = c(1,0,0,1),
                   comment2 = c("clita kasd gubergren", NA, "sanctus est", "consetetur sadipscing"), comment2_quote = c(0,NA,0,0),
                   comment3 = c(NA, NA, "invidunt ut labore", "ea rebum"), comment3_quote = c(NA,NA,1,0),
                   comment4 = c(NA, NA, NA, "dolores et ea rebum"), comment4_quote = c(NA,NA,NA,1))

data
  id n_comments              comment1 comment1_quote              comment2 comment2_quote           comment3 comment3_quote            comment4 comment4_quote
1  1          2 consetetur sadipscing              1  clita kasd gubergren              0               <NA>             NA                <NA>             NA
2  2          1           Lorem ipsum              0                  <NA>             NA               <NA>             NA                <NA>             NA
3  3          3         dolor sit ame              0           sanctus est              0 invidunt ut labore              1                <NA>             NA
4  4          4        nonumy eirmod               1 consetetur sadipscing              0           ea rebum              0 dolores et ea rebum              1

现在我们想通过

将此数据转换为长格式
  1. 每个产品的每条评论占一行
  2. 如果评论引用了引用则添加信息
  3. 保持一种产品的评论总数不变

这里是目标数据:

target_data = data.frame(id = c(1,1,2,3,3,3,4,4,4,4), n_comments = c(2,2,1,3,3,3,4,4,4,4),
                   comment = c("consetetur sadipscing", "Lorem ipsum", "dolor sit ame", "nonumy eirmod ","clita kasd gubergren", "sanctus est", "consetetur sadipscing",
                   "invidunt ut labore", "ea rebum", "dolores et ea rebum"),
                   quote = c(1,0,0,1,0,0,0,1,0,1))
 
target_data
   id n_comments               comment quote
1   1          2 consetetur sadipscing     1
2   1          2           Lorem ipsum     0
3   2          1         dolor sit ame     0
4   3          3        nonumy eirmod      1
5   3          3  clita kasd gubergren     0
6   3          3           sanctus est     0
7   4          4 consetetur sadipscing     0
8   4          4    invidunt ut labore     1
9   4          4              ea rebum     0
10  4          4   dolores et ea rebum     1

这是我试过的方法,但不起作用:

trial_da = data %>%  tidyr::pivot_longer(cols = starts_with('comment'), values_to = "comment", values_drop_na = TRUE)
Fehler: Can't combine `comment1` <character> and `comment1_quote` <double>.
Run `rlang::last_error()` to see where the error occurred.

trial_da
Fehler: Objekt 'trial_da' not found

发生这种情况是因为“引用”列也以“评论”开头。但是,我不确定如何解决这个问题。

data.table接近

library(data.table)
ans <- setorder(
  melt(setDT(data), 
       id.vars = c("id", "n_comments"), 
       measure.vars = patterns(comment = "comment[0-9]+$", 
                               quote = ".*_quote"),
       na.rm = TRUE), id)
#    id n_comments variable               comment quote
# 1:  1          2        1 consetetur sadipscing     1
# 2:  1          2        2  clita kasd gubergren     0
# 3:  2          1        1           Lorem ipsum     0
# 4:  3          3        1         dolor sit ame     0
# 5:  3          3        2           sanctus est     0
# 6:  3          3        3    invidunt ut labore     1
# 7:  4          4        1        nonumy eirmod      1
# 8:  4          4        2 consetetur sadipscing     0
# 9:  4          4        3              ea rebum     0
#10:  4          4        4   dolores et ea rebum     1

如果需要,您可以删除带有 ans[, variable := NULL] 的变量列。

稍微重命名列名后,您可以使用 tidyr::pivot_longer -

names(data) <- sub('comment\d+_|\d+', '', names(data))

tidyr::pivot_longer(data, 
                    cols = -c(id, n_comments), 
                    names_to = '.value',
                    names_pattern = '(comment|quote)', 
                    values_drop_na = TRUE)

#      id n_comments comment                 quote
#   <dbl>      <dbl> <chr>                   <dbl>
# 1     1          2 "consetetur sadipscing"     1
# 2     1          2 "clita kasd gubergren"      0
# 3     2          1 "Lorem ipsum"               0
# 4     3          3 "dolor sit ame"             0
# 5     3          3 "sanctus est"               0
# 6     3          3 "invidunt ut labore"        1
# 7     4          4 "nonumy eirmod "            1
# 8     4          4 "consetetur sadipscing"     0
# 9     4          4 "ea rebum"                  0
#10     4          4 "dolores et ea rebum"       1