有没有办法转换 data.table 使唯一的行元素成为列名,然后显示元素计数?
Is there a way to transform a data.table so that unique row elements become column names and then show element counts?
我有以下 data.table:
structure(list(index = structure(c(1571270400, 1571356800, 1571616000,
1571702400, 1571788800, 1571875200, 1571961600, 1572220800, 1572307200,
1572393600), tzone = "", tclass = c("POSIXct", "POSIXt"), class = c("POSIXct",
"POSIXt")), A = structure(c(10L, 10L, 7L, 7L, 9L, 9L, 4L, 4L,
4L, 4L), .Label = c("12", "13", "14", "21", "24", "31", "34",
"41", "42", "43"), class = "factor"), AA = structure(c(2L, 2L,
2L, 2L, 2L, 7L, 7L, 7L, 7L, 7L), .Label = c("12", "13", "14",
"21", "23", "24", "31", "32", "34", "41", "42", "43"), class = "factor"),
AAC = structure(c(6L, 11L, 7L, 7L, 7L, 7L, 7L, NA, NA, 7L
), .Label = c("12", "13", "14", "21", "23", "24", "31", "34",
"41", "42", "43"), class = "factor"), AAL = structure(c(2L,
2L, 2L, 2L, 2L, 7L, 7L, 7L, 7L, 7L), .Label = c("12", "13",
"14", "21", "23", "24", "31", "32", "34", "41", "42", "43"
), class = "factor")), class = c("data.table", "data.frame"
), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x5614347b5790>, sorted = "index")
以下是此数据在 table -
中的样子
index A B C D
1: 2019-10-17 43 13 24 13
2: 2019-10-18 43 13 43 13
3: 2019-10-21 34 13 31 13
4: 2019-10-22 34 13 31 13
5: 2019-10-23 42 13 31 13
6: 2019-10-24 42 31 31 31
7: 2019-10-25 21 31 31 31
8: 2019-10-28 21 31 <NA> 31
9: 2019-10-29 21 31 <NA> 31
10: 2019-10-30 21 31 31 31
我想对其进行转换,使行中的唯一元素成为列名,然后这些列显示这些元素的频率。
index 13 21 24 31 34 42 43 <NA>
1: 2019-10-17 2 0 1 0 0 0 1 0
2: 2019-10-18 2 0 0 0 0 0 2 0
3: 2019-10-21 2 0 0 1 1 0 0 0
4: 2019-10-22 2 0 0 1 1 0 0 0
5: 2019-10-23 2 0 0 1 0 1 0 0
6: 2019-10-24 3 0 0 0 0 1 0 0
7: 2019-10-25 3 1 0 0 0 0 0 0
8: 2019-10-28 2 1 0 0 0 0 0 1
9: 2019-10-29 2 1 0 0 0 0 0 1
10: 2019-10-30 3 1 0 0 0 0 0 0
我相信应该有一个聪明的方法来使用 reshape 或 data.table 函数来做到这一点。指向正确方向的指针将非常有帮助。
这是一个使用较新的 tidyverse
函数的解决方案。不过,它也适用于 data.tables。
- 首先我们从宽格式转换为长格式
cols
参数接受 tidyselect 帮助程序按名称选择列。 matches()
根据正则表达式选择列。您可以在此处的手册中阅读更多关于它们的信息:?tidyselect::select_helpers
- 然后我们展开回宽形式
- 我们使用
values_fn
对值应用 length
函数。这将给出唯一计数的数量
- 然后我们可以选择将所有数字列中的
NA
替换为 0
示例如下
library(tidyverse)
df %>%
pivot_longer(cols = matches('^A'))) %>% #convert to long form
pivot_wider(id_cols = 'index', names_from = 'value', # Then spread wide again
values_fn = list(value = length)) %>% # return length of vals
mutate_if(is.numeric, ~ ifelse(is.na(.), 0, .)) # replace NA with 0
# A tibble: 10 x 9
index `43` `13` `24` `34` `31` `42` `21` `NA`
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-10-16 17:00:00 1 2 1 0 0 0 0 0
2 2019-10-17 17:00:00 2 2 0 0 0 0 0 0
3 2019-10-20 17:00:00 0 2 0 1 1 0 0 0
4 2019-10-21 17:00:00 0 2 0 1 1 0 0 0
5 2019-10-22 17:00:00 0 2 0 0 1 1 0 0
6 2019-10-23 17:00:00 0 0 0 0 3 1 0 0
7 2019-10-24 17:00:00 0 0 0 0 3 0 1 0
8 2019-10-27 17:00:00 0 0 0 0 2 0 1 1
9 2019-10-28 17:00:00 0 0 0 0 2 0 1 1
10 2019-10-29 17:00:00 0 0 0 0 3 0 1 0
我们可以通过指定 id.var
将数据集 melt
转换为 'long' 格式,然后在指定fun.aggregate
作为 length
library(data.table)
dcast(melt(dt, id.var = 'index'), as.IDate(index) ~ value, length)
# index NA 13 21 24 31 34 42 43
# 1: 2019-10-16 0 2 0 1 0 0 0 1
# 2: 2019-10-17 0 2 0 0 0 0 0 2
# 3: 2019-10-20 0 2 0 0 1 1 0 0
# 4: 2019-10-21 0 2 0 0 1 1 0 0
# 5: 2019-10-22 0 2 0 0 1 0 1 0
# 6: 2019-10-23 0 0 0 0 3 0 1 0
# 7: 2019-10-24 0 0 1 0 3 0 0 0
# 8: 2019-10-27 1 0 1 0 2 0 0 0
# 9: 2019-10-28 1 0 1 0 2 0 0 0
#10: 2019-10-29 0 0 1 0 3 0 0 0
注意:如果我们不想要 NA
列,请在 melt
中指定 na.rm = TRUE
我有以下 data.table:
structure(list(index = structure(c(1571270400, 1571356800, 1571616000,
1571702400, 1571788800, 1571875200, 1571961600, 1572220800, 1572307200,
1572393600), tzone = "", tclass = c("POSIXct", "POSIXt"), class = c("POSIXct",
"POSIXt")), A = structure(c(10L, 10L, 7L, 7L, 9L, 9L, 4L, 4L,
4L, 4L), .Label = c("12", "13", "14", "21", "24", "31", "34",
"41", "42", "43"), class = "factor"), AA = structure(c(2L, 2L,
2L, 2L, 2L, 7L, 7L, 7L, 7L, 7L), .Label = c("12", "13", "14",
"21", "23", "24", "31", "32", "34", "41", "42", "43"), class = "factor"),
AAC = structure(c(6L, 11L, 7L, 7L, 7L, 7L, 7L, NA, NA, 7L
), .Label = c("12", "13", "14", "21", "23", "24", "31", "34",
"41", "42", "43"), class = "factor"), AAL = structure(c(2L,
2L, 2L, 2L, 2L, 7L, 7L, 7L, 7L, 7L), .Label = c("12", "13",
"14", "21", "23", "24", "31", "32", "34", "41", "42", "43"
), class = "factor")), class = c("data.table", "data.frame"
), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x5614347b5790>, sorted = "index")
以下是此数据在 table -
中的样子 index A B C D
1: 2019-10-17 43 13 24 13
2: 2019-10-18 43 13 43 13
3: 2019-10-21 34 13 31 13
4: 2019-10-22 34 13 31 13
5: 2019-10-23 42 13 31 13
6: 2019-10-24 42 31 31 31
7: 2019-10-25 21 31 31 31
8: 2019-10-28 21 31 <NA> 31
9: 2019-10-29 21 31 <NA> 31
10: 2019-10-30 21 31 31 31
我想对其进行转换,使行中的唯一元素成为列名,然后这些列显示这些元素的频率。
index 13 21 24 31 34 42 43 <NA>
1: 2019-10-17 2 0 1 0 0 0 1 0
2: 2019-10-18 2 0 0 0 0 0 2 0
3: 2019-10-21 2 0 0 1 1 0 0 0
4: 2019-10-22 2 0 0 1 1 0 0 0
5: 2019-10-23 2 0 0 1 0 1 0 0
6: 2019-10-24 3 0 0 0 0 1 0 0
7: 2019-10-25 3 1 0 0 0 0 0 0
8: 2019-10-28 2 1 0 0 0 0 0 1
9: 2019-10-29 2 1 0 0 0 0 0 1
10: 2019-10-30 3 1 0 0 0 0 0 0
我相信应该有一个聪明的方法来使用 reshape 或 data.table 函数来做到这一点。指向正确方向的指针将非常有帮助。
这是一个使用较新的 tidyverse
函数的解决方案。不过,它也适用于 data.tables。
- 首先我们从宽格式转换为长格式
cols
参数接受 tidyselect 帮助程序按名称选择列。matches()
根据正则表达式选择列。您可以在此处的手册中阅读更多关于它们的信息:?tidyselect::select_helpers
- 然后我们展开回宽形式
- 我们使用
values_fn
对值应用length
函数。这将给出唯一计数的数量 - 然后我们可以选择将所有数字列中的
NA
替换为0
示例如下
library(tidyverse)
df %>%
pivot_longer(cols = matches('^A'))) %>% #convert to long form
pivot_wider(id_cols = 'index', names_from = 'value', # Then spread wide again
values_fn = list(value = length)) %>% # return length of vals
mutate_if(is.numeric, ~ ifelse(is.na(.), 0, .)) # replace NA with 0
# A tibble: 10 x 9
index `43` `13` `24` `34` `31` `42` `21` `NA`
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-10-16 17:00:00 1 2 1 0 0 0 0 0
2 2019-10-17 17:00:00 2 2 0 0 0 0 0 0
3 2019-10-20 17:00:00 0 2 0 1 1 0 0 0
4 2019-10-21 17:00:00 0 2 0 1 1 0 0 0
5 2019-10-22 17:00:00 0 2 0 0 1 1 0 0
6 2019-10-23 17:00:00 0 0 0 0 3 1 0 0
7 2019-10-24 17:00:00 0 0 0 0 3 0 1 0
8 2019-10-27 17:00:00 0 0 0 0 2 0 1 1
9 2019-10-28 17:00:00 0 0 0 0 2 0 1 1
10 2019-10-29 17:00:00 0 0 0 0 3 0 1 0
我们可以通过指定 id.var
将数据集 melt
转换为 'long' 格式,然后在指定fun.aggregate
作为 length
library(data.table)
dcast(melt(dt, id.var = 'index'), as.IDate(index) ~ value, length)
# index NA 13 21 24 31 34 42 43
# 1: 2019-10-16 0 2 0 1 0 0 0 1
# 2: 2019-10-17 0 2 0 0 0 0 0 2
# 3: 2019-10-20 0 2 0 0 1 1 0 0
# 4: 2019-10-21 0 2 0 0 1 1 0 0
# 5: 2019-10-22 0 2 0 0 1 0 1 0
# 6: 2019-10-23 0 0 0 0 3 0 1 0
# 7: 2019-10-24 0 0 1 0 3 0 0 0
# 8: 2019-10-27 1 0 1 0 2 0 0 0
# 9: 2019-10-28 1 0 1 0 2 0 0 0
#10: 2019-10-29 0 0 1 0 3 0 0 0
注意:如果我们不想要 NA
列,请在 melt
na.rm = TRUE