如何计算与字符向量值匹配的观察值

How to count observations matching the values of a vector of characters

我有一个 dataframe,其中包含大量观察结果和不同类型的变量。这是我的示例 dataframe:

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")
# of observation Product Price in $ Place
1 Pizza 2 Supermarket
2 Cleaning Product 3.5 Supermarket
3 Chocolate 1 Supermarket
4 Fruit 1 Little Store
5 Red Meat 2.5 Supermarket
6 Cleaning Product 3.5 Supermarket
7 Bracelet 3 Little Store
8 Trucker Hat 5 Gas Station
9 Shirt 15 Supermarket
10 Shirt 20 Supermarket
11 Chicken Breast 2.5 Little Store
12 Chocolate 1 Gas Station
13 Cereal 2 Gas Station
14 Fruit 1 Little Store
15 Cleaning Product 3.5 Supermarket
16 Trucker Hat 4 Supermarket

我还有一个 vectorcharacters:

non.food <- c("Cleaning", "Hat", "Shirt", "Bracelet")

我必须消除与 vector non.food 中的任何单词相匹配的观察结果。为此,我使用以下代码:

non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = '|') 
mydf <- mydf %>% 
filter(!str_detect(Product,non.food))

它工作得很好,但我的印象是我丢失了比我应该丢失的更多的观察结果。例如,查看样本我应该失去 8 个观察值。但实际上我最终失去了 10 个(我没有在样本中显示它,因为实际上我有 8916 个观察结果,所以样本只是我面对的数据框类型的一个例子)

所以,我想首先计算与 vector 中的任何单词匹配的观察结果的数量,以确保我的 code 没有消除比它应该的更多的观察结果。我不能将命令用作 which(mydf$Product == non.food)sum(mydf$Product == non.food)。我可以执行与我的代码相反的操作,并仅通过与我的字符串匹配的观察结果进行过滤以进行验证,但这需要更多时间并创建更多我不想要的 data。有人有想法吗?

此外,if 我的 code 实际上消除了比应有的更多的观察结果,有人有解决方案吗?

提前致谢。

您可以添加一个计数变量,使用 case_when 计算已删除行的数量,例如

library(tidyverse)
    df <- tribble(
      ~"# of observation", ~Product, ~"Price in $", ~Place,
      1, "Pizza", 2, "Supermarket",
      2, "Cleaning Product", 3.5, "Supermarket",
      3, "Chocolate", 1, "Supermarket",
      4, "Fruit", 1, "Little Store",
      5, "Red Meat", 2.5, "Supermarket",
      6, "Cleaning Product", 3.5, "Supermarket",
      7, "Bracelet", 3, "Little Store",
      8, "Trucker Hat", 5, "Gas Station",
      9, "Shirt", 15, "Supermarket",
      10, "Shirt", 20, "Supermarket",
      11, "Chicken Breast", 2.5, "Little Store",
      12, "Chocolate", 1, "Gas Station",
      13, "Cereal", 2, "Gas Station",
      14, "Fruit", 1, "Little Store",
      15, "Cleaning Product", 3.5, "Supermarket",
      16, "Trucker Hat", 4, "Supermarket"
    )
    
    
    
    non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = "|")
    mydf <- df %>%
      mutate(count = case_when(
        str_detect(Product, non.food) ~ 1,
        TRUE ~ 0
      )) %>%
      mutate(sum_deleted = sum(count)) %>% 
      filter(!str_detect(Product, non.food))

要计算匹配或 non-matching 个元素,您可以使用

num_foods <- nrow(mydf[!str_detect(mydf$Product, non.food),])
num_non_foods <- nrow(mydf[str_detect(mydf$Product, non.food),])

你可以看到,num_foods == 8num_non_foods == 8,所以你的代码似乎做了它应该做的事情。

数据

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")