将重复数据从长数据转换为宽数据,同时在 R 中保留键值对

Convert Duplicated data from long to wide, while preserving key-value pairs in R

我需要从重复的数据集创建一个唯一的数据集(基于“id”变量),其中重复的数量变化很大。 我在下面创建了一个虚拟数据集(具有 5 个键值对),它捕获了真实数据集的本质。

library(dplyr)

df <- data.frame(id = c(1, 2, 3, 3, 2, 3), key = c(NA, "UJD02 JFF00", "UJD05 TPX10 DV071", "KFC10 DR036 UGC12 UEN05 XXA00", "UJD05", "DR036 UJD05 JFF00 TPX10"), value1 = c(23, NA, 45, 67, 11, 1), value2 = c(45, NA, 23, NA, 25, 78), value3 = c(89, NA, 103, 6700, 89, 50), value4 = c(786, 670, 987, 67, 12, 14), value5 = c(10, NA, 29, 15, 51, 3))

真实数据集包含由“id”变量确定的唯一和重复的观察值,重复的数量从两次到超过两次的重复不等。 “id”变量表示要将哪些观察值从长转换为宽,以便最终得到唯一“id”记录的数据集仅即没有重复的“id”。 “key”变量是一个复合变量,可以包含最多 30 个制表符分隔值之间的缺失(即 NA)。正好有 30 个“valueX”变量(即 value1 - value30). 在一次观察中(无论重复状态如何),每个键都与一个值耦合,例如key1value1, key2value2 ... key30value30。 重复表示数据是针对同一个客户在不同的时间点采集的,因此在长到宽的转换过程中,重复的各自的键值对不应混淆。

我做的第一件事是将复合“key”变量拆分为 30 个变量(key1 - key30),导致数据集类似于“df2”。

df %>% separate(key, c("key1", "key2", "key3", "key4", "key5")) -> df2

但在那之后我不确定如何在避免混淆键值对的同时基于“id”进行重复数据删除。也许我可能需要动态重命名(通过编号)键值对以指示重复项?现在确定了。

所以我需要帮助的地方是如何将长数据集 (df2) 转换为宽数据集“id”变量(仅 3 行 id 1 - 3)具有未混淆的键值对,即指示哪些键记录与哪个值对。例如。在虚拟数据集中,对于 id = 3,重复 3 次,我最终会得到 key1-value1 ... key15-value15

非常感谢任何帮助!

已编辑以提供示例输出

下面是所需的输出

library(wrapr)

resultX <- wrapr::build_frame(
  "id"  , "key1"       , "key2"       , "key3"       , "key4", "key5", "key6"       , "key7"       , "key8"       , "key9"       , "key10"      , "key11"      , "key12"      , "key13"      , "key14"      , "key15", "value1", "value2", "value3", "value4", "value5", "value6", "value7", "value8", "value9", "value10", "value11", "value12", "value13", "value14", "value15" |
    1   , NA_character_, NA_character_, NA_character_, NA    , NA    , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA     , 23      , 45      , 89      , 786     , 10      , NA_real_, NA_real_, NA_real_, NA_real_, NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_  |
    2   , "UJD02"      , "JFF00"      , NA_character_, NA    , NA    , "UJD05"      , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA     , NA_real_, NA_real_, NA_real_, 670     , NA_real_, 11      , 25      , 89      , 12      , 51       , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_  |
    3   , "UJD05"      , "TPX10"      , "DV071"      , NA    , NA    , "KFC10"      , "DR036"      , "UGC12"      , "UEN05"      , "XXA00"      , "DR036"      , "UJD05"      , "JFF00"      , "TPX10"      , NA     , 45      , 23      , 103     , 987     , 29      , 67      , NA_real_, 6700    , 67      , 15       , 1        , 78       , 50       , 14       , 3         )

两个支点:

library(dplyr)
library(tidyr) # pivot_*
df2 %>%
  pivot_longer(-id, names_pattern = "(.*?)([0-9]+)", names_to = c(".value", "iter")) %>%
  group_by(id) %>%
  mutate(iter = row_number()) %>%
  pivot_wider(id, names_from = "iter", values_from = c("key", "value"), names_sep = "") %>%
  ungroup()
# # A tibble: 3 x 31
#      id key1  key2  key3  key4  key5  key6  key7  key8  key9  key10 key11 key12 key13 key14 key15 value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12 value13 value14 value15
#   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
# 1     1 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>      23     45     89    786     10     NA     NA     NA     NA      NA      NA      NA      NA      NA      NA
# 2     2 UJD02 JFF00 <NA>  <NA>  <NA>  UJD05 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>      NA     NA     NA    670     NA     11     25     89     12      51      NA      NA      NA      NA      NA
# 3     3 UJD05 TPX10 DV071 <NA>  <NA>  KFC10 DR036 UGC12 UEN05 XXA00 DR036 UJD05 JFF00 TPX10 <NA>      45     23    103    987     29     67     NA   6700     67      15       1      78      50      14       3