将重复数据从长数据转换为宽数据,同时在 R 中保留键值对
Convert Duplicated data from long to wide, while preserving key-value pairs in R
我需要从重复的数据集创建一个唯一的数据集(基于“id”变量),其中重复的数量变化很大。
我在下面创建了一个虚拟数据集(具有 5 个键值对),它捕获了真实数据集的本质。
library(dplyr)
df <- data.frame(id = c(1, 2, 3, 3, 2, 3), key = c(NA, "UJD02 JFF00", "UJD05 TPX10 DV071", "KFC10 DR036 UGC12 UEN05 XXA00", "UJD05", "DR036 UJD05 JFF00 TPX10"), value1 = c(23, NA, 45, 67, 11, 1), value2 = c(45, NA, 23, NA, 25, 78), value3 = c(89, NA, 103, 6700, 89, 50), value4 = c(786, 670, 987, 67, 12, 14), value5 = c(10, NA, 29, 15, 51, 3))
真实数据集包含由“id”变量确定的唯一和重复的观察值,重复的数量从两次到超过两次的重复不等。 “id”变量表示要将哪些观察值从长转换为宽,以便最终得到唯一“id”记录的数据集仅即没有重复的“id”。
“key”变量是一个复合变量,可以包含最多 30 个制表符分隔值之间的缺失(即 NA)。正好有 30 个“valueX”变量(即 value1 - value30).
在一次观察中(无论重复状态如何),每个键都与一个值耦合,例如key1 到 value1, key2 到 value2 ... key30 到 value30。
重复表示数据是针对同一个客户在不同的时间点采集的,因此在长到宽的转换过程中,重复的各自的键值对不应混淆。
我做的第一件事是将复合“key”变量拆分为 30 个变量(key1 - key30),导致数据集类似于“df2”。
df %>% separate(key, c("key1", "key2", "key3", "key4", "key5")) -> df2
但在那之后我不确定如何在避免混淆键值对的同时基于“id”进行重复数据删除。也许我可能需要动态重命名(通过编号)键值对以指示重复项?现在确定了。
所以我需要帮助的地方是如何将长数据集 (df2) 转换为宽数据集“id”变量(仅 3 行 id 1 - 3)具有未混淆的键值对,即指示哪些键记录与哪个值对。例如。在虚拟数据集中,对于 id = 3,重复 3 次,我最终会得到 key1-value1 ... key15-value15
非常感谢任何帮助!
已编辑以提供示例输出
下面是所需的输出
library(wrapr)
resultX <- wrapr::build_frame(
"id" , "key1" , "key2" , "key3" , "key4", "key5", "key6" , "key7" , "key8" , "key9" , "key10" , "key11" , "key12" , "key13" , "key14" , "key15", "value1", "value2", "value3", "value4", "value5", "value6", "value7", "value8", "value9", "value10", "value11", "value12", "value13", "value14", "value15" |
1 , NA_character_, NA_character_, NA_character_, NA , NA , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA , 23 , 45 , 89 , 786 , 10 , NA_real_, NA_real_, NA_real_, NA_real_, NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ |
2 , "UJD02" , "JFF00" , NA_character_, NA , NA , "UJD05" , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA , NA_real_, NA_real_, NA_real_, 670 , NA_real_, 11 , 25 , 89 , 12 , 51 , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ |
3 , "UJD05" , "TPX10" , "DV071" , NA , NA , "KFC10" , "DR036" , "UGC12" , "UEN05" , "XXA00" , "DR036" , "UJD05" , "JFF00" , "TPX10" , NA , 45 , 23 , 103 , 987 , 29 , 67 , NA_real_, 6700 , 67 , 15 , 1 , 78 , 50 , 14 , 3 )
两个支点:
library(dplyr)
library(tidyr) # pivot_*
df2 %>%
pivot_longer(-id, names_pattern = "(.*?)([0-9]+)", names_to = c(".value", "iter")) %>%
group_by(id) %>%
mutate(iter = row_number()) %>%
pivot_wider(id, names_from = "iter", values_from = c("key", "value"), names_sep = "") %>%
ungroup()
# # A tibble: 3 x 31
# id key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12 value13 value14 value15
# <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 23 45 89 786 10 NA NA NA NA NA NA NA NA NA NA
# 2 2 UJD02 JFF00 <NA> <NA> <NA> UJD05 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA 670 NA 11 25 89 12 51 NA NA NA NA NA
# 3 3 UJD05 TPX10 DV071 <NA> <NA> KFC10 DR036 UGC12 UEN05 XXA00 DR036 UJD05 JFF00 TPX10 <NA> 45 23 103 987 29 67 NA 6700 67 15 1 78 50 14 3
我需要从重复的数据集创建一个唯一的数据集(基于“id”变量),其中重复的数量变化很大。 我在下面创建了一个虚拟数据集(具有 5 个键值对),它捕获了真实数据集的本质。
library(dplyr)
df <- data.frame(id = c(1, 2, 3, 3, 2, 3), key = c(NA, "UJD02 JFF00", "UJD05 TPX10 DV071", "KFC10 DR036 UGC12 UEN05 XXA00", "UJD05", "DR036 UJD05 JFF00 TPX10"), value1 = c(23, NA, 45, 67, 11, 1), value2 = c(45, NA, 23, NA, 25, 78), value3 = c(89, NA, 103, 6700, 89, 50), value4 = c(786, 670, 987, 67, 12, 14), value5 = c(10, NA, 29, 15, 51, 3))
真实数据集包含由“id”变量确定的唯一和重复的观察值,重复的数量从两次到超过两次的重复不等。 “id”变量表示要将哪些观察值从长转换为宽,以便最终得到唯一“id”记录的数据集仅即没有重复的“id”。 “key”变量是一个复合变量,可以包含最多 30 个制表符分隔值之间的缺失(即 NA)。正好有 30 个“valueX”变量(即 value1 - value30). 在一次观察中(无论重复状态如何),每个键都与一个值耦合,例如key1 到 value1, key2 到 value2 ... key30 到 value30。 重复表示数据是针对同一个客户在不同的时间点采集的,因此在长到宽的转换过程中,重复的各自的键值对不应混淆。
我做的第一件事是将复合“key”变量拆分为 30 个变量(key1 - key30),导致数据集类似于“df2”。
df %>% separate(key, c("key1", "key2", "key3", "key4", "key5")) -> df2
但在那之后我不确定如何在避免混淆键值对的同时基于“id”进行重复数据删除。也许我可能需要动态重命名(通过编号)键值对以指示重复项?现在确定了。
所以我需要帮助的地方是如何将长数据集 (df2) 转换为宽数据集“id”变量(仅 3 行 id 1 - 3)具有未混淆的键值对,即指示哪些键记录与哪个值对。例如。在虚拟数据集中,对于 id = 3,重复 3 次,我最终会得到 key1-value1 ... key15-value15
非常感谢任何帮助!
已编辑以提供示例输出
下面是所需的输出
library(wrapr)
resultX <- wrapr::build_frame(
"id" , "key1" , "key2" , "key3" , "key4", "key5", "key6" , "key7" , "key8" , "key9" , "key10" , "key11" , "key12" , "key13" , "key14" , "key15", "value1", "value2", "value3", "value4", "value5", "value6", "value7", "value8", "value9", "value10", "value11", "value12", "value13", "value14", "value15" |
1 , NA_character_, NA_character_, NA_character_, NA , NA , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA , 23 , 45 , 89 , 786 , 10 , NA_real_, NA_real_, NA_real_, NA_real_, NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ |
2 , "UJD02" , "JFF00" , NA_character_, NA , NA , "UJD05" , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA , NA_real_, NA_real_, NA_real_, 670 , NA_real_, 11 , 25 , 89 , 12 , 51 , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ |
3 , "UJD05" , "TPX10" , "DV071" , NA , NA , "KFC10" , "DR036" , "UGC12" , "UEN05" , "XXA00" , "DR036" , "UJD05" , "JFF00" , "TPX10" , NA , 45 , 23 , 103 , 987 , 29 , 67 , NA_real_, 6700 , 67 , 15 , 1 , 78 , 50 , 14 , 3 )
两个支点:
library(dplyr)
library(tidyr) # pivot_*
df2 %>%
pivot_longer(-id, names_pattern = "(.*?)([0-9]+)", names_to = c(".value", "iter")) %>%
group_by(id) %>%
mutate(iter = row_number()) %>%
pivot_wider(id, names_from = "iter", values_from = c("key", "value"), names_sep = "") %>%
ungroup()
# # A tibble: 3 x 31
# id key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12 value13 value14 value15
# <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 23 45 89 786 10 NA NA NA NA NA NA NA NA NA NA
# 2 2 UJD02 JFF00 <NA> <NA> <NA> UJD05 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA 670 NA 11 25 89 12 51 NA NA NA NA NA
# 3 3 UJD05 TPX10 DV071 <NA> <NA> KFC10 DR036 UGC12 UEN05 XXA00 DR036 UJD05 JFF00 TPX10 <NA> 45 23 103 987 29 67 NA 6700 67 15 1 78 50 14 3