R data.table: subsetting data.table/dataframe 基于行值的大小
R data.table: subsetting data.table/dataframe based on size of row value
这是一个基本问题,但我被难住了:
我有以下 R data.table:
library(data.table)
DT <- fread('unique_point biased data_points team groupID
up1 FALSE 3 A xy28352
up1 TRUE 4 A xy28352
up2 FALSE 1 A xy28352
up2 TRUE 0 X xy28352
up3 FALSE 12 Y xy28352
up3 TRUE 35 Z xy28352')
打印为
> DT
unique_point biased data_points team groupID
1: up1 FALSE 3 A xy28352
2: up1 TRUE 4 A xy28352
3: up2 FALSE 1 A xy28352
4: up2 TRUE 0 X xy28352
5: up3 FALSE 12 Y xy28352
6: up3 TRUE 35 Z xy28352
team
列的值是字母 A 到 Z,26 种可能性。在这一刻。如果我用这段代码计算行值:
DT[, counts := .N, by=c("team")]
这给出了
> DT
unique_point biased data_points team groupID counts
1: up1 FALSE 3 A xy28352 3
2: up1 TRUE 4 A xy28352 3
3: up2 FALSE 1 A xy28352 3
4: up2 TRUE 0 X xy28352 1
5: up3 FALSE 12 Y xy28352 1
6: up3 TRUE 35 Z xy28352 1
我想在 DT
中创建 26 个新列,它给出每个 team
、A
、B
、C
等的大小.
结果 data.table 看起来像:
> DT
unique_point biased data_points team groupID A B C ... Z
1: up1 FALSE 3 A xy28352 3 0 0 ... 1
2: up1 TRUE 4 A xy28352 3 0 0 ... 1
3: up2 FALSE 1 A xy28352 3 0 0 ... 1
4: up2 TRUE 0 X xy28352 3 0 0 ... 1
5: up3 FALSE 12 Y xy28352 3 0 0 ... 1
6: up3 TRUE 35 Z xy28352 3 0 0 ... 1
我不确定如何使用 data.table
语法做到这一点..
编辑:我也很高兴用 base R 和 dplyr 来做这件事。
那plyr
呢,可以吗?
library(data.table)
library(plyr)
DT <- fread('unique_point biased data_points team groupID
up1 FALSE 3 A xy28352
up1 TRUE 4 A xy28352
up2 FALSE 1 A xy28352
up2 TRUE 0 X xy28352
up3 FALSE 12 Y xy28352
up3 TRUE 35 Z xy28352')
ldply(LETTERS, function(x){
n <- nrow(DT[team == as.character(x),])
DT[, as.character(x) := n]
return(DT[team == x,])
})
> DT
unique_point biased data_points team groupID A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1: up1 FALSE 3 A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
2: up1 TRUE 4 A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
3: up2 FALSE 1 A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
4: up2 TRUE 0 X xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
5: up3 FALSE 12 Y xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
6: up3 TRUE 35 Z xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
这是一个不寻常的解决方案,但它确实有效。我使用了 dplyr
和 tidyr
DT[, counts := .N, by=c("team")]
x <- data.frame(team = sample(LETTERS,26))%>%arrange(team)
y <- DT%>%select(team,counts)%>%unique()
df <- x%>%left_join(y,"team")%>%spread(team, counts,fill = 0)
cbind(DT,df)
注意:left_join 确实会发出警告消息,但不会篡改输出,并且有解决方法
这是一个基本问题,但我被难住了:
我有以下 R data.table:
library(data.table)
DT <- fread('unique_point biased data_points team groupID
up1 FALSE 3 A xy28352
up1 TRUE 4 A xy28352
up2 FALSE 1 A xy28352
up2 TRUE 0 X xy28352
up3 FALSE 12 Y xy28352
up3 TRUE 35 Z xy28352')
打印为
> DT
unique_point biased data_points team groupID
1: up1 FALSE 3 A xy28352
2: up1 TRUE 4 A xy28352
3: up2 FALSE 1 A xy28352
4: up2 TRUE 0 X xy28352
5: up3 FALSE 12 Y xy28352
6: up3 TRUE 35 Z xy28352
team
列的值是字母 A 到 Z,26 种可能性。在这一刻。如果我用这段代码计算行值:
DT[, counts := .N, by=c("team")]
这给出了
> DT
unique_point biased data_points team groupID counts
1: up1 FALSE 3 A xy28352 3
2: up1 TRUE 4 A xy28352 3
3: up2 FALSE 1 A xy28352 3
4: up2 TRUE 0 X xy28352 1
5: up3 FALSE 12 Y xy28352 1
6: up3 TRUE 35 Z xy28352 1
我想在 DT
中创建 26 个新列,它给出每个 team
、A
、B
、C
等的大小.
结果 data.table 看起来像:
> DT
unique_point biased data_points team groupID A B C ... Z
1: up1 FALSE 3 A xy28352 3 0 0 ... 1
2: up1 TRUE 4 A xy28352 3 0 0 ... 1
3: up2 FALSE 1 A xy28352 3 0 0 ... 1
4: up2 TRUE 0 X xy28352 3 0 0 ... 1
5: up3 FALSE 12 Y xy28352 3 0 0 ... 1
6: up3 TRUE 35 Z xy28352 3 0 0 ... 1
我不确定如何使用 data.table
语法做到这一点..
编辑:我也很高兴用 base R 和 dplyr 来做这件事。
那plyr
呢,可以吗?
library(data.table)
library(plyr)
DT <- fread('unique_point biased data_points team groupID
up1 FALSE 3 A xy28352
up1 TRUE 4 A xy28352
up2 FALSE 1 A xy28352
up2 TRUE 0 X xy28352
up3 FALSE 12 Y xy28352
up3 TRUE 35 Z xy28352')
ldply(LETTERS, function(x){
n <- nrow(DT[team == as.character(x),])
DT[, as.character(x) := n]
return(DT[team == x,])
})
> DT
unique_point biased data_points team groupID A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1: up1 FALSE 3 A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
2: up1 TRUE 4 A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
3: up2 FALSE 1 A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
4: up2 TRUE 0 X xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
5: up3 FALSE 12 Y xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
6: up3 TRUE 35 Z xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
这是一个不寻常的解决方案,但它确实有效。我使用了 dplyr
和 tidyr
DT[, counts := .N, by=c("team")]
x <- data.frame(team = sample(LETTERS,26))%>%arrange(team)
y <- DT%>%select(team,counts)%>%unique()
df <- x%>%left_join(y,"team")%>%spread(team, counts,fill = 0)
cbind(DT,df)
注意:left_join 确实会发出警告消息,但不会篡改输出,并且有解决方法