带字符串的应急 table
contingency table with strings
我有这个数据框glimpse(df)
Observations: 2,211
Variables: 3
$ city <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...
这里是小dput()
structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas",
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city",
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
数据由具有不同美食 categories
的不同城市 city
组成。
我想做一个应急措施 table 来可视化哪些菜系与关闭 (is_opem = 0)
或开放 (is_open = 1)
.
相关
我想做这件事以备不时之需 table。为此,我尝试了这个,但出现了这个错误:
xtabs(is_open ~., data = df)
Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
当我转换变量 as.factor()
时,我得到很多 table,一个都没有。有什么方法可以使它看起来像下面这样吗?
Categorie/City Las Vegas Pittsburgh
Korean 50/50 30/70
Sushi Bars 40/60 40/60
列中的数字是每个城市每个类别的关闭 (is_opem = 0)
和打开 (is_open = 1)
的频率(例如,对于拉斯维加斯的韩国人,关闭 (0) 和打开 ( 1) 是 50/50)。
这里有一个解决方案,使用 data.table
来 cast
您的数据,使用基于 stringi
包中的 stri_count
的计数函数。后者也可以通过 table
或 sum(grepl())
和 ifelse
构造来实现(取决于数据结构、速度要求等方面所需的灵活性)。请注意,我还借助 this answer 将您的数据重新格式化为更干净的 "long format"。如果您从一开始就以这种方式格式化数据,则可能会跳过此重新格式化。我希望这就是您要找的。
#your data
df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas")
,categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean")
,is_open = c("0", "0", "1", "0", "1"))
,.Names = c("city", "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
library(data.table)
library(stringi)
#format data to correct "long format"
DT <- as.data.table(df)
DT[, categories := gsub("c\(\"|\"|\"\)", "", categories)]
DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))),
by = .(city, is_open)]
# city is_open categories
# 1: Las Vegas 0 Korean
# 2: Las Vegas 0 Sushi Bars
# 3: Pittsburgh 0 Japanese
# 4: Pittsburgh 0 Sushi Bars
# 5: Las Vegas 1 Thai
# 6: Las Vegas 1 Korean
# 7: Phoenix 0 Sushi Bars
# 8: Phoenix 0 Japanese
#specify all_unique_count_items to also cover items that are not present in x
calc_count_distr <- function(x, all_unique_count_items) {
count_distribution <- sapply(all_unique_count_items, function(y) {
100*round(sum(stri_count_fixed(x, y))/length(x), d =2)
})
paste(count_distribution, collapse = "/")
}
dcast.data.table(DT, categories ~ city, value.var = "is_open"
,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open))
,fill = NA)
# categories Las Vegas Phoenix Pittsburgh
#1: Japanese NA 100/0 100/0
#2: Korean 50/50 NA NA
#3: Sushi Bars 100/0 100/0 100/0
#4: Thai 0/100 NA NA
我有这个数据框glimpse(df)
Observations: 2,211
Variables: 3
$ city <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...
这里是小dput()
structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas",
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city",
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
数据由具有不同美食 categories
的不同城市 city
组成。
我想做一个应急措施 table 来可视化哪些菜系与关闭 (is_opem = 0)
或开放 (is_open = 1)
.
我想做这件事以备不时之需 table。为此,我尝试了这个,但出现了这个错误:
xtabs(is_open ~., data = df)
Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
当我转换变量 as.factor()
时,我得到很多 table,一个都没有。有什么方法可以使它看起来像下面这样吗?
Categorie/City Las Vegas Pittsburgh
Korean 50/50 30/70
Sushi Bars 40/60 40/60
列中的数字是每个城市每个类别的关闭 (is_opem = 0)
和打开 (is_open = 1)
的频率(例如,对于拉斯维加斯的韩国人,关闭 (0) 和打开 ( 1) 是 50/50)。
这里有一个解决方案,使用 data.table
来 cast
您的数据,使用基于 stringi
包中的 stri_count
的计数函数。后者也可以通过 table
或 sum(grepl())
和 ifelse
构造来实现(取决于数据结构、速度要求等方面所需的灵活性)。请注意,我还借助 this answer 将您的数据重新格式化为更干净的 "long format"。如果您从一开始就以这种方式格式化数据,则可能会跳过此重新格式化。我希望这就是您要找的。
#your data
df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas")
,categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean")
,is_open = c("0", "0", "1", "0", "1"))
,.Names = c("city", "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
library(data.table)
library(stringi)
#format data to correct "long format"
DT <- as.data.table(df)
DT[, categories := gsub("c\(\"|\"|\"\)", "", categories)]
DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))),
by = .(city, is_open)]
# city is_open categories
# 1: Las Vegas 0 Korean
# 2: Las Vegas 0 Sushi Bars
# 3: Pittsburgh 0 Japanese
# 4: Pittsburgh 0 Sushi Bars
# 5: Las Vegas 1 Thai
# 6: Las Vegas 1 Korean
# 7: Phoenix 0 Sushi Bars
# 8: Phoenix 0 Japanese
#specify all_unique_count_items to also cover items that are not present in x
calc_count_distr <- function(x, all_unique_count_items) {
count_distribution <- sapply(all_unique_count_items, function(y) {
100*round(sum(stri_count_fixed(x, y))/length(x), d =2)
})
paste(count_distribution, collapse = "/")
}
dcast.data.table(DT, categories ~ city, value.var = "is_open"
,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open))
,fill = NA)
# categories Las Vegas Phoenix Pittsburgh
#1: Japanese NA 100/0 100/0
#2: Korean 50/50 NA NA
#3: Sushi Bars 100/0 100/0 100/0
#4: Thai 0/100 NA NA