带字符串的应急 table

contingency table with strings

我有这个数据框glimpse(df)

Observations: 2,211
Variables: 3
$ city       <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open    <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...

这里是小dput()

structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", 
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")", 
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", 
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city", 
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")

数据由具有不同美食 categories 的不同城市 city 组成。 我想做一个应急措施 table 来可视化哪些菜系与关闭 (is_opem = 0) 或开放 (is_open = 1).

相关

我想做这件事以备不时之需 table。为此,我尝试了这个,但出现了这个错误:

xtabs(is_open ~., data = df)

Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument

当我转换变量 as.factor() 时,我得到很多 table,一个都没有。有什么方法可以使它看起来像下面这样吗?

Categorie/City          Las Vegas     Pittsburgh
           Korean       50/50         30/70
           Sushi Bars   40/60         40/60

列中的数字是每个城市每个类别的关闭 (is_opem = 0) 和打开 (is_open = 1) 的频率(例如,对于拉斯维加斯的韩国人,关闭 (0) 和打开 ( 1) 是 50/50)。

这里有一个解决方案,使用 data.tablecast 您的数据,使用基于 stringi 包中的 stri_count 的计数函数。后者也可以通过 tablesum(grepl())ifelse 构造来实现(取决于数据结构、速度要求等方面所需的灵活性)。请注意,我还借助 this answer 将您的数据重新格式化为更干净的 "long format"。如果您从一开始就以这种方式格式化数据,则可能会跳过此重新格式化。我希望这就是您要找的。

#your data
df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas")
                       ,categories = c("c(\"Korean\", \"Sushi Bars\")", 
                                     "c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", 
                                     "Korean")
                       ,is_open = c("0", "0", "1", "0", "1"))
                       ,.Names = c("city",  "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")

library(data.table)
library(stringi)                                  

#format data to correct "long format"
DT <- as.data.table(df)
DT[, categories := gsub("c\(\"|\"|\"\)", "", categories)]
DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))), 
         by = .(city, is_open)]
#           city is_open categories
# 1:  Las Vegas       0     Korean
# 2:  Las Vegas       0 Sushi Bars
# 3: Pittsburgh       0   Japanese
# 4: Pittsburgh       0 Sushi Bars
# 5:  Las Vegas       1       Thai
# 6:  Las Vegas       1     Korean
# 7:    Phoenix       0 Sushi Bars
# 8:    Phoenix       0   Japanese

#specify all_unique_count_items to also cover items that are not present in x
calc_count_distr <-  function(x, all_unique_count_items) {

    count_distribution <- sapply(all_unique_count_items, function(y) {
                                     100*round(sum(stri_count_fixed(x, y))/length(x), d =2)
                                })
    paste(count_distribution, collapse = "/")
}

dcast.data.table(DT, categories ~ city, value.var = "is_open"
                 ,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open))
                 ,fill = NA)
#   categories Las Vegas Phoenix Pittsburgh
#1:   Japanese        NA   100/0      100/0
#2:     Korean     50/50      NA         NA
#3: Sushi Bars     100/0   100/0      100/0
#4:       Thai     0/100      NA         NA