在没有聚合的情况下旋转数据框
pivot a dataframe without aggregation
目的是将数据框(代表一对多关系:一台计算机到多台显示器)转向更广泛的表示。
数据框(缩写)可以是:
library(tidyverse)
df <- tibble::tribble(
~CPU_ID, ~ID, ~CONFIGITEM_NUMBER, ~NAME, ~AllocationDate, ~Model, ~Vendor,
182434, 195251, 101142000825, "COMP000572", "2014-04-10", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182434, 405022, 1142027261, "COMP030500", "2020-12-02", "V173A", "ACER",
182436, 183607, 101142000008, "COMP000008", "2014-04-18", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182437, 228469, 1142006861, "COMP020117", "2018-03-05", "S22C45KBW", "Samsung",
182437, 341806, 1142019822, "COMP050244", "2019-01-09", "L1940T", "HP",
182438, 205930, 101142001009, "COMP050002", "2019-05-20", "S22C45KBW", "Samsung",
182439, 240546, 1142008622, "COMP050131", "2016-09-16", "SAMSUNG SYNCMASTER 943", "SAMSUNG",
182462, 184114, 101142000515, "COMP000515", "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182463, 184113, 101142000514, "COMP000514", "2019-08-28", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182464, 184106, 101142000507, "COMP000507", "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard"
)
我可以通过以下方式正确旋转它:
df %>%
group_by(CPU_ID) %>%
filter(row_number() == 1) %>%
ungroup() %>%
rename_with( ~ paste0("monitor1_", .), .cols = !CPU_ID) %>%
left_join(
df %>%
group_by(CPU_ID) %>%
filter(row_number() == 2) %>%
ungroup() %>%
rename_with( ~ paste0("monitor2_", .), .cols = !CPU_ID),
by = "CPU_ID"
)
#> # A tibble: 8 x 13
#> CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor
#> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 182434 195251 101142000825 COMP000572 2014-04-10 HP ELITE DISP~ Hewlett-Packard
#> 2 182436 183607 101142000008 COMP000008 2014-04-18 HP ELITE DISP~ Hewlett-Packard
#> 3 182437 228469 1142006861 COMP020117 2018-03-05 S22C45KBW Samsung
#> 4 182438 205930 101142001009 COMP050002 2019-05-20 S22C45KBW Samsung
#> 5 182439 240546 1142008622 COMP050131 2016-09-16 SAMSUNG SYNCM~ SAMSUNG
#> 6 182462 184114 101142000515 COMP000515 2019-08-27 HP ELITE DISP~ Hewlett-Packard
#> 7 182463 184113 101142000514 COMP000514 2019-08-28 HP ELITE DISP~ Hewlett-Packard
#> 8 182464 184106 101142000507 COMP000507 2019-08-27 HP ELITE DISP~ Hewlett-Packard
#> # ... with 6 more variables: monitor2_ID <dbl>, monitor2_CONFIGITEM_NUMBER <dbl>,
#> # monitor2_NAME <chr>, monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>
但是在真实的dataframe中,也有每台电脑超过两台显示器的情况,所以这个公式需要很多left_join
.
我试图写一个替代方案,例如:
df %>%
group_by(CPU_ID) %>%
mutate(monitor_n = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = CPU_ID,
names_from = monitor_n,
values_from = !CPU_ID
) %>%
select(-starts_with("monitor_n")) %>%
rename_with(function(colname)
str_replace(colname, "^(.*)_(\d)$", "monitor\2_\1"),
.cols = !CPU_ID)
#> # A tibble: 8 x 13
#> CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 182434 195251 405022 101142000825 1142027261 COMP000572 COMP030500
#> 2 182436 183607 NA 101142000008 NA COMP000008 <NA>
#> 3 182437 228469 341806 1142006861 1142019822 COMP020117 COMP050244
#> 4 182438 205930 NA 101142001009 NA COMP050002 <NA>
#> 5 182439 240546 NA 1142008622 NA COMP050131 <NA>
#> 6 182462 184114 NA 101142000515 NA COMP000515 <NA>
#> 7 182463 184113 NA 101142000514 NA COMP000514 <NA>
#> 8 182464 184106 NA 101142000507 NA COMP000507 <NA>
#> # ... with 6 more variables: monitor1_AllocationDate <chr>, monitor2_AllocationDate <chr>,
#> # monitor1_Model <chr>, monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>
但我需要按照与原始数据框相同的顺序维护列。
你能推荐其他更简单(更简洁)的替代方案吗?
也许是这样的?
df %>%
group_by(CPU_ID) %>%
mutate(rowno = row_number()) %>%
ungroup %>%
gather(var, val, -CPU_ID, -rowno) %>%
mutate(newcolname = paste0("monitor", rowno, "_", var)) %>%
select(-c(var, rowno)) %>%
pivot_wider(names_from = newcolname, values_from = val)
# A tibble: 8 x 13
CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME monitor1_Alloca~ monitor2_Alloca~ monitor1_Model
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 182434 195251 405022 101142000825 1142027261 COMP000572 COMP030500 2014-04-10 2020-12-02 HP ELITE DISP~
2 182436 183607 NA 101142000008 NA COMP000008 NA 2014-04-18 NA HP ELITE DISP~
3 182437 228469 341806 1142006861 1142019822 COMP020117 COMP050244 2018-03-05 2019-01-09 S22C45KBW
4 182438 205930 NA 101142001009 NA COMP050002 NA 2019-05-20 NA S22C45KBW
5 182439 240546 NA 1142008622 NA COMP050131 NA 2016-09-16 NA SAMSUNG SYNCM~
6 182462 184114 NA 101142000515 NA COMP000515 NA 2019-08-27 NA HP ELITE DISP~
7 182463 184113 NA 101142000514 NA COMP000514 NA 2019-08-28 NA HP ELITE DISP~
8 182464 184106 NA 101142000507 NA COMP000507 NA 2019-08-27 NA HP ELITE DISP~
# ... with 3 more variables: monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>
也可以使用 pivot_longer
,但它会改变列的顺序(如果需要可以更正):
df %>%
group_by(CPU_ID) %>%
mutate(rowno = row_number()) %>%
ungroup %>%
pivot_longer(-c(CPU_ID, rowno), names_to = "var", values_to = "val", values_transform = list(val = as.character)) %>%
mutate(newcolname = paste0("monitor", rowno, "_", var)) %>%
select(-c(var, rowno)) %>%
pivot_wider(names_from = newcolname, values_from = val)
# A tibble: 8 x 13
CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor monitor2_ID monitor2_CONFIG~ monitor2_NAME
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 182434 195251 101142000825 COMP000572 2014-04-10 HP ELITE DISP~ Hewlett-Packard 405022 1142027261 COMP030500
2 182436 183607 101142000008 COMP000008 2014-04-18 HP ELITE DISP~ Hewlett-Packard NA NA NA
3 182437 228469 1142006861 COMP020117 2018-03-05 S22C45KBW Samsung 341806 1142019822 COMP050244
4 182438 205930 101142001009 COMP050002 2019-05-20 S22C45KBW Samsung NA NA NA
5 182439 240546 1142008622 COMP050131 2016-09-16 SAMSUNG SYNCM~ SAMSUNG NA NA NA
6 182462 184114 101142000515 COMP000515 2019-08-27 HP ELITE DISP~ Hewlett-Packard NA NA NA
7 182463 184113 101142000514 COMP000514 2019-08-28 HP ELITE DISP~ Hewlett-Packard NA NA NA
8 182464 184106 101142000507 COMP000507 2019-08-27 HP ELITE DISP~ Hewlett-Packard NA NA NA
# ... with 3 more variables: monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>
与@Lennyy 的第二个解决方案类似,我建议旋转更长的时间然后旋转更宽。一个潜在的缺点是您至少需要暂时将它们全部设为同一类型,例如字符,但如有必要,您可以在最后将其中的任何一个转换回来。
df %>%
pivot_longer(cols = -CPU_ID, names_to = "variable", values_to = "value",
values_transform = list(value = as.character)) %>%
group_by(CPU_ID, variable) %>%
mutate(variable = paste(variable, row_number(), sep = "_")) %>%
ungroup() %>%
pivot_wider(names_from = variable, values_from = value)
# A tibble: 8 x 13
CPU_ID ID_1 CONFIGITEM_NUMBER… NAME_1 AllocationDate_1 Model_1 Vendor_1 ID_2 CONFIGITEM_NUMBE… NAME_2 AllocationDate_2 Model_2 Vendor_2
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 182434 195251 101142000825 COMP000… 2014-04-10 HP ELITE DISP… Hewlett-Pa… 4050… 1142027261 COMP03… 2020-12-02 V173A ACER
2 182436 183607 101142000008 COMP000… 2014-04-18 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
3 182437 228469 1142006861 COMP020… 2018-03-05 S22C45KBW Samsung 3418… 1142019822 COMP05… 2019-01-09 L1940T HP
4 182438 205930 101142001009 COMP050… 2019-05-20 S22C45KBW Samsung NA NA NA NA NA NA
5 182439 240546 1142008622 COMP050… 2016-09-16 SAMSUNG SYNCM… SAMSUNG NA NA NA NA NA NA
6 182462 184114 101142000515 COMP000… 2019-08-27 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
7 182463 184113 101142000514 COMP000… 2019-08-28 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
8 182464 184106 101142000507 COMP000… 2019-08-27 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
郑重声明,我最终使用的是(基于 Lenny's and Jon Spring's 个回答):
df %>%
pivot_longer(
cols = !CPU_ID,
names_to = "variable",
values_to = "value",
values_transform = list(value = as.character)
) %>%
group_by(CPU_ID, variable) %>%
mutate(variable = paste0("monitor", row_number(), "_", variable)) %>%
ungroup() %>%
pivot_wider(names_from = variable, values_from = value)
目的是将数据框(代表一对多关系:一台计算机到多台显示器)转向更广泛的表示。
数据框(缩写)可以是:
library(tidyverse)
df <- tibble::tribble(
~CPU_ID, ~ID, ~CONFIGITEM_NUMBER, ~NAME, ~AllocationDate, ~Model, ~Vendor,
182434, 195251, 101142000825, "COMP000572", "2014-04-10", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182434, 405022, 1142027261, "COMP030500", "2020-12-02", "V173A", "ACER",
182436, 183607, 101142000008, "COMP000008", "2014-04-18", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182437, 228469, 1142006861, "COMP020117", "2018-03-05", "S22C45KBW", "Samsung",
182437, 341806, 1142019822, "COMP050244", "2019-01-09", "L1940T", "HP",
182438, 205930, 101142001009, "COMP050002", "2019-05-20", "S22C45KBW", "Samsung",
182439, 240546, 1142008622, "COMP050131", "2016-09-16", "SAMSUNG SYNCMASTER 943", "SAMSUNG",
182462, 184114, 101142000515, "COMP000515", "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182463, 184113, 101142000514, "COMP000514", "2019-08-28", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
182464, 184106, 101142000507, "COMP000507", "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard"
)
我可以通过以下方式正确旋转它:
df %>%
group_by(CPU_ID) %>%
filter(row_number() == 1) %>%
ungroup() %>%
rename_with( ~ paste0("monitor1_", .), .cols = !CPU_ID) %>%
left_join(
df %>%
group_by(CPU_ID) %>%
filter(row_number() == 2) %>%
ungroup() %>%
rename_with( ~ paste0("monitor2_", .), .cols = !CPU_ID),
by = "CPU_ID"
)
#> # A tibble: 8 x 13
#> CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor
#> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 182434 195251 101142000825 COMP000572 2014-04-10 HP ELITE DISP~ Hewlett-Packard
#> 2 182436 183607 101142000008 COMP000008 2014-04-18 HP ELITE DISP~ Hewlett-Packard
#> 3 182437 228469 1142006861 COMP020117 2018-03-05 S22C45KBW Samsung
#> 4 182438 205930 101142001009 COMP050002 2019-05-20 S22C45KBW Samsung
#> 5 182439 240546 1142008622 COMP050131 2016-09-16 SAMSUNG SYNCM~ SAMSUNG
#> 6 182462 184114 101142000515 COMP000515 2019-08-27 HP ELITE DISP~ Hewlett-Packard
#> 7 182463 184113 101142000514 COMP000514 2019-08-28 HP ELITE DISP~ Hewlett-Packard
#> 8 182464 184106 101142000507 COMP000507 2019-08-27 HP ELITE DISP~ Hewlett-Packard
#> # ... with 6 more variables: monitor2_ID <dbl>, monitor2_CONFIGITEM_NUMBER <dbl>,
#> # monitor2_NAME <chr>, monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>
但是在真实的dataframe中,也有每台电脑超过两台显示器的情况,所以这个公式需要很多left_join
.
我试图写一个替代方案,例如:
df %>%
group_by(CPU_ID) %>%
mutate(monitor_n = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = CPU_ID,
names_from = monitor_n,
values_from = !CPU_ID
) %>%
select(-starts_with("monitor_n")) %>%
rename_with(function(colname)
str_replace(colname, "^(.*)_(\d)$", "monitor\2_\1"),
.cols = !CPU_ID)
#> # A tibble: 8 x 13
#> CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 182434 195251 405022 101142000825 1142027261 COMP000572 COMP030500
#> 2 182436 183607 NA 101142000008 NA COMP000008 <NA>
#> 3 182437 228469 341806 1142006861 1142019822 COMP020117 COMP050244
#> 4 182438 205930 NA 101142001009 NA COMP050002 <NA>
#> 5 182439 240546 NA 1142008622 NA COMP050131 <NA>
#> 6 182462 184114 NA 101142000515 NA COMP000515 <NA>
#> 7 182463 184113 NA 101142000514 NA COMP000514 <NA>
#> 8 182464 184106 NA 101142000507 NA COMP000507 <NA>
#> # ... with 6 more variables: monitor1_AllocationDate <chr>, monitor2_AllocationDate <chr>,
#> # monitor1_Model <chr>, monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>
但我需要按照与原始数据框相同的顺序维护列。
你能推荐其他更简单(更简洁)的替代方案吗?
也许是这样的?
df %>%
group_by(CPU_ID) %>%
mutate(rowno = row_number()) %>%
ungroup %>%
gather(var, val, -CPU_ID, -rowno) %>%
mutate(newcolname = paste0("monitor", rowno, "_", var)) %>%
select(-c(var, rowno)) %>%
pivot_wider(names_from = newcolname, values_from = val)
# A tibble: 8 x 13
CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME monitor1_Alloca~ monitor2_Alloca~ monitor1_Model
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 182434 195251 405022 101142000825 1142027261 COMP000572 COMP030500 2014-04-10 2020-12-02 HP ELITE DISP~
2 182436 183607 NA 101142000008 NA COMP000008 NA 2014-04-18 NA HP ELITE DISP~
3 182437 228469 341806 1142006861 1142019822 COMP020117 COMP050244 2018-03-05 2019-01-09 S22C45KBW
4 182438 205930 NA 101142001009 NA COMP050002 NA 2019-05-20 NA S22C45KBW
5 182439 240546 NA 1142008622 NA COMP050131 NA 2016-09-16 NA SAMSUNG SYNCM~
6 182462 184114 NA 101142000515 NA COMP000515 NA 2019-08-27 NA HP ELITE DISP~
7 182463 184113 NA 101142000514 NA COMP000514 NA 2019-08-28 NA HP ELITE DISP~
8 182464 184106 NA 101142000507 NA COMP000507 NA 2019-08-27 NA HP ELITE DISP~
# ... with 3 more variables: monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>
也可以使用 pivot_longer
,但它会改变列的顺序(如果需要可以更正):
df %>%
group_by(CPU_ID) %>%
mutate(rowno = row_number()) %>%
ungroup %>%
pivot_longer(-c(CPU_ID, rowno), names_to = "var", values_to = "val", values_transform = list(val = as.character)) %>%
mutate(newcolname = paste0("monitor", rowno, "_", var)) %>%
select(-c(var, rowno)) %>%
pivot_wider(names_from = newcolname, values_from = val)
# A tibble: 8 x 13
CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor monitor2_ID monitor2_CONFIG~ monitor2_NAME
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 182434 195251 101142000825 COMP000572 2014-04-10 HP ELITE DISP~ Hewlett-Packard 405022 1142027261 COMP030500
2 182436 183607 101142000008 COMP000008 2014-04-18 HP ELITE DISP~ Hewlett-Packard NA NA NA
3 182437 228469 1142006861 COMP020117 2018-03-05 S22C45KBW Samsung 341806 1142019822 COMP050244
4 182438 205930 101142001009 COMP050002 2019-05-20 S22C45KBW Samsung NA NA NA
5 182439 240546 1142008622 COMP050131 2016-09-16 SAMSUNG SYNCM~ SAMSUNG NA NA NA
6 182462 184114 101142000515 COMP000515 2019-08-27 HP ELITE DISP~ Hewlett-Packard NA NA NA
7 182463 184113 101142000514 COMP000514 2019-08-28 HP ELITE DISP~ Hewlett-Packard NA NA NA
8 182464 184106 101142000507 COMP000507 2019-08-27 HP ELITE DISP~ Hewlett-Packard NA NA NA
# ... with 3 more variables: monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>
与@Lennyy 的第二个解决方案类似,我建议旋转更长的时间然后旋转更宽。一个潜在的缺点是您至少需要暂时将它们全部设为同一类型,例如字符,但如有必要,您可以在最后将其中的任何一个转换回来。
df %>%
pivot_longer(cols = -CPU_ID, names_to = "variable", values_to = "value",
values_transform = list(value = as.character)) %>%
group_by(CPU_ID, variable) %>%
mutate(variable = paste(variable, row_number(), sep = "_")) %>%
ungroup() %>%
pivot_wider(names_from = variable, values_from = value)
# A tibble: 8 x 13
CPU_ID ID_1 CONFIGITEM_NUMBER… NAME_1 AllocationDate_1 Model_1 Vendor_1 ID_2 CONFIGITEM_NUMBE… NAME_2 AllocationDate_2 Model_2 Vendor_2
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 182434 195251 101142000825 COMP000… 2014-04-10 HP ELITE DISP… Hewlett-Pa… 4050… 1142027261 COMP03… 2020-12-02 V173A ACER
2 182436 183607 101142000008 COMP000… 2014-04-18 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
3 182437 228469 1142006861 COMP020… 2018-03-05 S22C45KBW Samsung 3418… 1142019822 COMP05… 2019-01-09 L1940T HP
4 182438 205930 101142001009 COMP050… 2019-05-20 S22C45KBW Samsung NA NA NA NA NA NA
5 182439 240546 1142008622 COMP050… 2016-09-16 SAMSUNG SYNCM… SAMSUNG NA NA NA NA NA NA
6 182462 184114 101142000515 COMP000… 2019-08-27 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
7 182463 184113 101142000514 COMP000… 2019-08-28 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
8 182464 184106 101142000507 COMP000… 2019-08-27 HP ELITE DISP… Hewlett-Pa… NA NA NA NA NA NA
郑重声明,我最终使用的是(基于 Lenny's and Jon Spring's 个回答):
df %>%
pivot_longer(
cols = !CPU_ID,
names_to = "variable",
values_to = "value",
values_transform = list(value = as.character)
) %>%
group_by(CPU_ID, variable) %>%
mutate(variable = paste0("monitor", row_number(), "_", variable)) %>%
ungroup() %>%
pivot_wider(names_from = variable, values_from = value)