数据从宽格式到长格式,具有多个不同类型的重复列
Data wrangling from wide to long format with multiple repeating columns of different types
一个数据集描述了多个集群的多个重复测量,每个测量集群对包含在一个列中。我想将数据整理成长(呃)格式,这样一列就可以提供有关集群的信息,但每个测量值都保留在自己的列中。
# Current format
df_wider <- data.frame(
id = 1:5,
fruit_1 = sample(fruit, size = 5),
date_1 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_1 = sample(1:100, 5),
fruit_2 = sample(fruit, size = 5),
date_2 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_2 = sample(1:100, 5),
fruit_3 = sample(fruit, size = 5),
date_3 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_3 = sample(1:100, 5)
)
# Desired format
df_longer <- data.frame(
id = rep(1:5, each = 3),
cluster = rep(1:3, 5),
fruit = sample(fruit, size = 15),
date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 15),
number = sample(1:100, 15)
)
真实数据集最多包含 25 个集群,每个集群有 100 多个测量值。我尝试使用 tidyr::gather()
和 tidyr::pivot_longer()
迭代每次测量,但生成的中间数据帧的大小呈指数增长。由于值不同 class,因此不可能在单个 tidyr::pivot_longer()
步中尝试这样做。我想不出一种方法来将其按比例矢量化。
你可以这样做:
library(tidyr)
library(dplyr)
df_wider %>% pivot_longer(-id,
names_pattern = "(.*)_(\d)",
names_to = c(".value", "cluster"))
# A tibble: 15 x 5
id cluster fruit date number
<int> <chr> <fct> <date> <int>
1 1 1 olive 2020-04-21 50
2 1 2 elderberry 2020-02-23 59
3 1 3 cherimoya 2020-03-07 9
4 2 1 jujube 2020-03-22 88
5 2 2 mandarine 2020-03-06 45
6 2 3 grape 2020-04-23 78
7 3 1 nut 2020-01-26 53
8 3 2 cantaloupe 2020-01-27 70
9 3 3 durian 2020-02-15 39
10 4 1 chili pepper 2020-03-17 60
11 4 2 raisin 2020-04-14 20
12 4 3 cloudberry 2020-03-11 4
13 5 1 honeydew 2020-01-04 81
14 5 2 lime 2020-03-23 53
15 5 3 ugli fruit 2020-01-13 26
我们可以使用 melt
来自 data.table
library(data.table)
melt(setDT(df_wider), measure = patterns('^fruit', '^date', '^number' ),
value.name = c('fruit', 'date', 'number'), variable.name = 'cluster')
# id cluster fruit date number
# 1: 1 1 date 2020-04-16 17
# 2: 2 1 quince 2020-01-27 7
# 3: 3 1 coconut 2020-04-19 33
# 4: 4 1 pomegranate 2020-02-27 55
# 5: 5 1 persimmon 2020-02-20 62
# 6: 1 2 kiwi fruit 2020-01-14 100
# 7: 2 2 cranberry 2020-03-15 97
# 8: 3 2 cucumber 2020-03-16 5
# 9: 4 2 persimmon 2020-03-06 81
#10: 5 2 date 2020-04-17 30
#11: 1 3 apricot 2020-04-13 86
#12: 2 3 banana 2020-04-17 42
#13: 3 3 bilberry 2020-02-23 88
#14: 4 3 blackcurrant 2020-02-25 10
#15: 5 3 raisin 2020-02-09 87
一个数据集描述了多个集群的多个重复测量,每个测量集群对包含在一个列中。我想将数据整理成长(呃)格式,这样一列就可以提供有关集群的信息,但每个测量值都保留在自己的列中。
# Current format
df_wider <- data.frame(
id = 1:5,
fruit_1 = sample(fruit, size = 5),
date_1 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_1 = sample(1:100, 5),
fruit_2 = sample(fruit, size = 5),
date_2 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_2 = sample(1:100, 5),
fruit_3 = sample(fruit, size = 5),
date_3 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_3 = sample(1:100, 5)
)
# Desired format
df_longer <- data.frame(
id = rep(1:5, each = 3),
cluster = rep(1:3, 5),
fruit = sample(fruit, size = 15),
date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 15),
number = sample(1:100, 15)
)
真实数据集最多包含 25 个集群,每个集群有 100 多个测量值。我尝试使用 tidyr::gather()
和 tidyr::pivot_longer()
迭代每次测量,但生成的中间数据帧的大小呈指数增长。由于值不同 class,因此不可能在单个 tidyr::pivot_longer()
步中尝试这样做。我想不出一种方法来将其按比例矢量化。
你可以这样做:
library(tidyr)
library(dplyr)
df_wider %>% pivot_longer(-id,
names_pattern = "(.*)_(\d)",
names_to = c(".value", "cluster"))
# A tibble: 15 x 5
id cluster fruit date number
<int> <chr> <fct> <date> <int>
1 1 1 olive 2020-04-21 50
2 1 2 elderberry 2020-02-23 59
3 1 3 cherimoya 2020-03-07 9
4 2 1 jujube 2020-03-22 88
5 2 2 mandarine 2020-03-06 45
6 2 3 grape 2020-04-23 78
7 3 1 nut 2020-01-26 53
8 3 2 cantaloupe 2020-01-27 70
9 3 3 durian 2020-02-15 39
10 4 1 chili pepper 2020-03-17 60
11 4 2 raisin 2020-04-14 20
12 4 3 cloudberry 2020-03-11 4
13 5 1 honeydew 2020-01-04 81
14 5 2 lime 2020-03-23 53
15 5 3 ugli fruit 2020-01-13 26
我们可以使用 melt
来自 data.table
library(data.table)
melt(setDT(df_wider), measure = patterns('^fruit', '^date', '^number' ),
value.name = c('fruit', 'date', 'number'), variable.name = 'cluster')
# id cluster fruit date number
# 1: 1 1 date 2020-04-16 17
# 2: 2 1 quince 2020-01-27 7
# 3: 3 1 coconut 2020-04-19 33
# 4: 4 1 pomegranate 2020-02-27 55
# 5: 5 1 persimmon 2020-02-20 62
# 6: 1 2 kiwi fruit 2020-01-14 100
# 7: 2 2 cranberry 2020-03-15 97
# 8: 3 2 cucumber 2020-03-16 5
# 9: 4 2 persimmon 2020-03-06 81
#10: 5 2 date 2020-04-17 30
#11: 1 3 apricot 2020-04-13 86
#12: 2 3 banana 2020-04-17 42
#13: 3 3 bilberry 2020-02-23 88
#14: 4 3 blackcurrant 2020-02-25 10
#15: 5 3 raisin 2020-02-09 87