如何根据 R 中的因子水平计算数据框中值的频率?
How do I count the frequency of a value in a data frame based on a factor level in R?
我有一个法律数据集,其中所有列都由因子表示
> str(df)
'data.frame': 2101 obs. of 4 variables:
$ specialty: Factor w/ 5 levels "Real Estate",..: 2 2 2 2 2 2 2 2 2 2 ...
$ col1 : Factor w/ 161 levels "10060","11404",..: 95 40 72 52 72 72 72 161 161 161 ...
$ col2 : Factor w/ 138 levels "0277T","11602",..: 63 18 76 29 138 50 138 138 138 138 ...
$ col3 : Factor w/ 106 levels "10061","10160",..: 44 58 106 51 106 58 106 106 106 106 ...
第 1-3 列由对应于特定法律程序的 5 位代码组成。代码可以在同一列内重复,也可以跨不同列重复。代码被组织为因素。我有兴趣获得一组 7 个代码的频率,[49585, 44310, 44320, 38564, 44125, 44150, 49419]
> head(df)
specialty col1 col2 col3
1 Bankruptcy 49585 49000 44950
2 Tort 44140 38564 49255
3 Real Estate 49000 49419 NULL
4 Bankruptcy 44310 44120 49000
5 Real Estate 49000 NULL NULL
6 Tort 49000 44950 49255
但是,我只想获取这些代码与专业栏中的两个特定级别关联时的频率:"Tort"
和 "Real Estate"
。由于存在因素,这很棘手。仅当它们也出现在与上述任一级别相同的行中时,我如何才能找到集合中每个代码的计数?
预期输出:
**Counts** 49585 44310 44320 38564 44125 44150 49419
Tort 12 230 232 1 21 2 23
Real Estate 280 50 40 92 121 12 726
也许你需要
df1 <- subset(df, specialty %in% c('Real Estate', 'Tort'))
library(reshape2)
dM <- melt(df1, id.var='specialty')[,-2]
dM[] <- lapply(dM, factor)
table(dM)
# value
#specialty 38564 44140 44950 49000 49255 49419 NULL
# Real Estate 0 0 0 2 0 1 3
# Tort 1 1 1 1 2 0 0
或者
res <- recast(df1, id.var='specialty', specialty~value, length)
res
# specialty 38564 44140 44950 49000 49255 49419 NULL
#1 Real Estate 0 0 0 2 0 1 3
#2 Tort 1 1 1 1 2 0 0
res[c(TRUE,!colSums(!res[-1]))]
# specialty 49000
#1 Real Estate 2
#2 Tort 1
数据
df1 <- structure(list(specialty = structure(c(1L, 3L, 2L, 1L, 2L, 3L
), .Label = c("Bankruptcy", "Real Estate", "Tort"), class = "factor"),
col1 = structure(c(4L, 1L, 3L, 2L, 3L, 3L), .Label = c("44140",
"44310", "49000", "49585"), class = "factor"), col2 = structure(c(4L,
1L, 5L, 2L, 6L, 3L), .Label = c("38564", "44120", "44950",
"49000", "49419", "NULL"), class = "factor"), col3 = structure(c(1L,
3L, 4L, 2L, 4L, 3L), .Label = c("44950", "49000", "49255",
"NULL"), class = "factor")), .Names = c("specialty", "col1",
"col2", "col3"), row.names = c("1", "2", "3", "4", "5", "6"),
class = "data.frame")
我有一个法律数据集,其中所有列都由因子表示
> str(df)
'data.frame': 2101 obs. of 4 variables:
$ specialty: Factor w/ 5 levels "Real Estate",..: 2 2 2 2 2 2 2 2 2 2 ...
$ col1 : Factor w/ 161 levels "10060","11404",..: 95 40 72 52 72 72 72 161 161 161 ...
$ col2 : Factor w/ 138 levels "0277T","11602",..: 63 18 76 29 138 50 138 138 138 138 ...
$ col3 : Factor w/ 106 levels "10061","10160",..: 44 58 106 51 106 58 106 106 106 106 ...
第 1-3 列由对应于特定法律程序的 5 位代码组成。代码可以在同一列内重复,也可以跨不同列重复。代码被组织为因素。我有兴趣获得一组 7 个代码的频率,[49585, 44310, 44320, 38564, 44125, 44150, 49419]
> head(df)
specialty col1 col2 col3
1 Bankruptcy 49585 49000 44950
2 Tort 44140 38564 49255
3 Real Estate 49000 49419 NULL
4 Bankruptcy 44310 44120 49000
5 Real Estate 49000 NULL NULL
6 Tort 49000 44950 49255
但是,我只想获取这些代码与专业栏中的两个特定级别关联时的频率:"Tort"
和 "Real Estate"
。由于存在因素,这很棘手。仅当它们也出现在与上述任一级别相同的行中时,我如何才能找到集合中每个代码的计数?
预期输出:
**Counts** 49585 44310 44320 38564 44125 44150 49419
Tort 12 230 232 1 21 2 23
Real Estate 280 50 40 92 121 12 726
也许你需要
df1 <- subset(df, specialty %in% c('Real Estate', 'Tort'))
library(reshape2)
dM <- melt(df1, id.var='specialty')[,-2]
dM[] <- lapply(dM, factor)
table(dM)
# value
#specialty 38564 44140 44950 49000 49255 49419 NULL
# Real Estate 0 0 0 2 0 1 3
# Tort 1 1 1 1 2 0 0
或者
res <- recast(df1, id.var='specialty', specialty~value, length)
res
# specialty 38564 44140 44950 49000 49255 49419 NULL
#1 Real Estate 0 0 0 2 0 1 3
#2 Tort 1 1 1 1 2 0 0
res[c(TRUE,!colSums(!res[-1]))]
# specialty 49000
#1 Real Estate 2
#2 Tort 1
数据
df1 <- structure(list(specialty = structure(c(1L, 3L, 2L, 1L, 2L, 3L
), .Label = c("Bankruptcy", "Real Estate", "Tort"), class = "factor"),
col1 = structure(c(4L, 1L, 3L, 2L, 3L, 3L), .Label = c("44140",
"44310", "49000", "49585"), class = "factor"), col2 = structure(c(4L,
1L, 5L, 2L, 6L, 3L), .Label = c("38564", "44120", "44950",
"49000", "49419", "NULL"), class = "factor"), col3 = structure(c(1L,
3L, 4L, 2L, 4L, 3L), .Label = c("44950", "49000", "49255",
"NULL"), class = "factor")), .Names = c("specialty", "col1",
"col2", "col3"), row.names = c("1", "2", "3", "4", "5", "6"),
class = "data.frame")