从列表中提取某些字符并将它们转换为字符向量

Extract certain characters from list and convert them into a character vector

我的数据框中有一列是字符列表。这是专栏 categories

str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   4 obs. of  3 variables:
 $ categories:List of 4
  ..$ : chr  "Tex-Mex" "Mexican" "Fast Food" "Restaurants"
  ..$ : chr  "Hawaiian" "Restaurants" "Barbeque"
  ..$ : chr  "Restaurants" "Italian" "Seafood"
  ..$ : chr  "Restaurants" "Mexican" "American (Traditional)"
 $ name      : chr  "Taco Bell" "Ohana Hawaiian BBQ" "Carrabba's Italian Grill" "Don Tequila"
 $ type      : chr  "business" "business" "business" "business"

这是前四行的dput

structure(list(categories = list(c("Tex-Mex", "Mexican", "Fast Food", 
"Restaurants"), c("Hawaiian", "Restaurants", "Barbeque"), c("Restaurants", 
"Italian", "Seafood"), c("Restaurants", "Mexican", "American (Traditional)"
)), name = c("Taco Bell", "Ohana Hawaiian BBQ", "Carrabba's Italian Grill", 
"Don Tequila"), type = c("business", "business", "business", 
"business")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", 
"data.frame"), .Names = c("categories", "name", "type"))

我想从该列表中提取一些值,以便这些值是唯一保留在该向量中的值。

例如,我想过滤掉所有不是 "Mexican" 和 "Restaurants" 的值。所以剩下的唯一值是 "Mexican" 和 "Restaurants"。为此,我尝试了这个解决方案:

df_test <- df %>% unnest(categories) %>% 
          filter(str_detect(categories, "Mexican")
                (str_detect(categories, "Restaurants")) %>% 
          nest(categories)

但结果是这样的:

str(df_test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   4 obs. of  3 variables:
 $ name: chr  "Taco Bell" "Ohana Hawaiian BBQ" "Carrabba's Italian Grill" "Don Tequila"
 $ type: chr  "business" "business" "business" "business"
 $ data:List of 4
  ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  1 variable:
  .. ..$ categories: chr  "Mexican" "Restaurants"
  ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1 obs. of  1 variable:
  .. ..$ categories: chr "Restaurants"
  ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1 obs. of  1 variable:
  .. ..$ categories: chr "Restaurants"
  ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  1 variable:
  .. ..$ categories: chr  "Restaurants" "Mexican"

问题是,此后该列不是 type 列那样的字符向量。

是否有可能过滤掉那些字符,以便在该过程之后该列是像 nametype 列那样的普通字符向量? 我不想替换通过此过程删除的 values/rows。所以如果某行中没有"Mexican"或"Restaurants",则该行将被删除。

使用过的软件包: dplyr stringr

使用 lapply 对列表进行子集化

lapply(df1$categories, function(x) x[x %in% c("Mexican", "Restaurants")])

[[1]]
[1] "Mexican"     "Restaurants"

[[2]]
[1] "Restaurants"

[[3]]
[1] "Restaurants"

[[4]]
[1] "Restaurants" "Mexican"

添加没有匹配条件的行来过滤行

df1 <- rbind(df1, c(list("Nothing to match"), "drop me", "business"))
df1$categories <- lapply(df1$categories, function(x) x[x %in% c("Mexican", "Restaurants")])
df1[sapply(df1$categories, length) > 0, ]

将列表折叠成字符串

df1$categories <- sapply(df1$categories, function(x) paste(sort(x[x %in% c("Mexican", "Restaurants")]), collapse=" "))
df1[nchar(df1$categories) > 0, ]

# A tibble: 4 x 3
           categories                     name     type
                <chr>                    <chr>    <chr>
1 Mexican Restaurants                Taco Bell business
2         Restaurants       Ohana Hawaiian BBQ business
3         Restaurants Carrabba's Italian Grill business
4 Mexican Restaurants              Don Tequila business