在 R 中将数据框旋转两次,然后恢复到原始形状
Pivot data frame to longer twice in R then back to original shape
假设我想知道四个篮球运动员中哪一个是最好的,我设置了一个小型锦标赛,两名球员进行 1 对 1 比赛,我记录了一组统计数据
#rm(list=ls())
set.seed(1234)
# some made up scores from my tournament
df <- data.frame(
player1 = c("a", "a", "b", "c", "d", "d"),
player2 = c("b", "c", "d", "b", "a", "c"),
date = c("2021-01-01", "2021-01-02", "2021-01-04", "2021-01-05", "2021-01-06", "2021-01-08"),
p1_dunks = sample(c(4:11), 6, replace = TRUE),
p2_dunks = sample(c(3:12), 6, replace = TRUE),
p1_blocks = sample(c(8:10), 6, replace = TRUE),
p2_blocks = sample(c(10:12), 6, replace = TRUE),
p1_threepointers = sample(c(2:7), 6, replace = TRUE),
p2_threepointers = sample(c(1:5), 6, replace = TRUE)
)
为了计算一名球员在锦标赛的任何时候表现如何,我可以将其旋转两次,并将每个统计数据的计数替换为每个计数的累计总和
# cast to long and get cumulative stats per player
melted_df <- df %>%
pivot_longer(cols = starts_with(c("p1", "p2")), names_to = "stat", values_to = "number") %>%
pivot_longer(cols = starts_with("player"), names_to = "player", values_to = "name") %>%
filter(
(player == "player1" & grepl("^p1", stat)) |
(player == "player2" & grepl("^p2", stat))
) %>%
arrange(date) %>%
group_by(player, stat) %>%
mutate(number = cumsum(number))
然后我可以很容易地查询这个
melted_df %>%
filter(date < "2021-01-05") %>%
filter(!duplicated(name)) %>%
filter(grepl("dunks$", stat))
但是对于我的用例来说,我需要将这种长格式数据强制转换回其原始形式(播放器 1、播放器 2,然后是每个播放器 1 和播放器 2 的统计数据)。我可以试试
# try to cast back to original format
back_to_wider_df <- melted_df %>%
pivot_wider(names_from = "player", values_from = "name") %>%
pivot_wider(names_from = "stat", values_from = "number")
但这反而给出了一个数据框,每个匹配项 'offset' 一行半满 NA 值:
> head(back_to_wider_df)
# A tibble: 6 × 9
date player1 player2 p1_dunks p1_blocks p1_threepointers p2_dunks p2_blocks p2_threepointers
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
1 2021-01-01 a NA 7 9 6 NA NA NA
2 2021-01-01 NA b NA NA NA 11 11 4
3 2021-01-02 a NA 18 18 9 NA NA NA
4 2021-01-02 NA c NA NA NA 18 22 8
5 2021-01-04 b NA 23 27 15 NA NA NA
6 2021-01-04 NA d NA NA NA 26 32 11
是否有一种简单的方法可以将其修复回原始形状,以便前三行应显示为:
> df
date player1 player2 p1_dunks p1_blocks p1_threepointers p2_dunks p2_blocks p2_three_pointers
1 2021-01-01 a b 7 9 6 11 11 4
2 2021-01-02 a c 18 18 9 18 22 8
3 2021-01-04 b d 23 27 15 26 32 11
谢谢,
一种方法是使用 lead
函数并删除 NA
library(dplyr)
df %>%
mutate(across(c(player2, p2_dunks, p2_blocks, p2_threepointers), lead)) %>%
na.omit()
date player1 player2 p1_dunks p1_blocks p1_threepointers p2_dunks p2_blocks p2_threepointers
1 2021-01-01 a b 7 9 6 11 11 4
3 2021-01-02 a c 18 18 9 18 22 8
5 2021-01-04 b d 23 27 15 26 32 11
假设我想知道四个篮球运动员中哪一个是最好的,我设置了一个小型锦标赛,两名球员进行 1 对 1 比赛,我记录了一组统计数据
#rm(list=ls())
set.seed(1234)
# some made up scores from my tournament
df <- data.frame(
player1 = c("a", "a", "b", "c", "d", "d"),
player2 = c("b", "c", "d", "b", "a", "c"),
date = c("2021-01-01", "2021-01-02", "2021-01-04", "2021-01-05", "2021-01-06", "2021-01-08"),
p1_dunks = sample(c(4:11), 6, replace = TRUE),
p2_dunks = sample(c(3:12), 6, replace = TRUE),
p1_blocks = sample(c(8:10), 6, replace = TRUE),
p2_blocks = sample(c(10:12), 6, replace = TRUE),
p1_threepointers = sample(c(2:7), 6, replace = TRUE),
p2_threepointers = sample(c(1:5), 6, replace = TRUE)
)
为了计算一名球员在锦标赛的任何时候表现如何,我可以将其旋转两次,并将每个统计数据的计数替换为每个计数的累计总和
# cast to long and get cumulative stats per player
melted_df <- df %>%
pivot_longer(cols = starts_with(c("p1", "p2")), names_to = "stat", values_to = "number") %>%
pivot_longer(cols = starts_with("player"), names_to = "player", values_to = "name") %>%
filter(
(player == "player1" & grepl("^p1", stat)) |
(player == "player2" & grepl("^p2", stat))
) %>%
arrange(date) %>%
group_by(player, stat) %>%
mutate(number = cumsum(number))
然后我可以很容易地查询这个
melted_df %>%
filter(date < "2021-01-05") %>%
filter(!duplicated(name)) %>%
filter(grepl("dunks$", stat))
但是对于我的用例来说,我需要将这种长格式数据强制转换回其原始形式(播放器 1、播放器 2,然后是每个播放器 1 和播放器 2 的统计数据)。我可以试试
# try to cast back to original format
back_to_wider_df <- melted_df %>%
pivot_wider(names_from = "player", values_from = "name") %>%
pivot_wider(names_from = "stat", values_from = "number")
但这反而给出了一个数据框,每个匹配项 'offset' 一行半满 NA 值:
> head(back_to_wider_df)
# A tibble: 6 × 9
date player1 player2 p1_dunks p1_blocks p1_threepointers p2_dunks p2_blocks p2_threepointers
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
1 2021-01-01 a NA 7 9 6 NA NA NA
2 2021-01-01 NA b NA NA NA 11 11 4
3 2021-01-02 a NA 18 18 9 NA NA NA
4 2021-01-02 NA c NA NA NA 18 22 8
5 2021-01-04 b NA 23 27 15 NA NA NA
6 2021-01-04 NA d NA NA NA 26 32 11
是否有一种简单的方法可以将其修复回原始形状,以便前三行应显示为:
> df
date player1 player2 p1_dunks p1_blocks p1_threepointers p2_dunks p2_blocks p2_three_pointers
1 2021-01-01 a b 7 9 6 11 11 4
2 2021-01-02 a c 18 18 9 18 22 8
3 2021-01-04 b d 23 27 15 26 32 11
谢谢,
一种方法是使用 lead
函数并删除 NA
library(dplyr)
df %>%
mutate(across(c(player2, p2_dunks, p2_blocks, p2_threepointers), lead)) %>%
na.omit()
date player1 player2 p1_dunks p1_blocks p1_threepointers p2_dunks p2_blocks p2_threepointers
1 2021-01-01 a b 7 9 6 11 11 4
3 2021-01-02 a c 18 18 9 18 22 8
5 2021-01-04 b d 23 27 15 26 32 11