Tagging/categorizing 使用多个匹配模式的字符串列
Tagging/categorizing a string column using multiple matching patterns
我有一个数据框,其中有一列字符串需要根据另一个数据框进行分类,该数据框在一列中具有类别标签,在另一列中具有匹配的 terms/patterns。
有 50 多个类别,每个字符串可以匹配多个类别,而其他字符串则没有匹配项。如何使用类别标签有效地标记这些字符串?
下面是一个简单的示例数据集和我希望得到的输出。如果有什么不同的话,真实数据集中的字符串比这些样本字符串长得多,有几十万个。
recipes <- c('fresh asparagus', 'a bunch of bananas', 'one pound pork', 'no fruits, no veggies, no nothing', 'broccoli or spinach','I like apples, asparagus, and pork', 'meats like lamb', 'venison sausage and fried eggs', 'spinach and arugula salad', 'scrambled or poached eggs', 'sourdough english muffins')
recipes_df <- data.frame(recipes, stringsAsFactors = FALSE)
category <- c('vegetable', 'fruit', 'meat','bread','dairy')
items <- c('arugula|asparagus|broccoli|peas|spinach', 'apples|bananas|blueberries|oranges', 'lamb|pork|turkey|venison', 'sourdough', 'buttermilk|butter|cream|eggs')
category_df <- data.frame(category, items)
这是我希望得到的输出:
recipes recipes_category
1 fresh asparagus vegetable
2 a bunch of bananas fruit
3 one pound pork meat
4 no fruits, no veggies, no nothing <NA>
5 broccoli or spinach vegetable
6 I like apples, asparagus, and pork fruit, vegetable, meat
7 meats like lamb meat
8 venison sausage and fried eggs meat, dairy
9 spinach and arugula salad vegetable
10 scrambled or poached eggs dairy
11 sourdough english muffins breads
我认为 grepl 和 for 循环的某种组合或 apply 的某个版本是必要的,但我在下面尝试的示例确实暴露了我对 R 的了解有多么少。例如,使用 sapply 给出了我期望的结果, sapply(category_df$items, grepl, recipes_df$recipes)
但我不确定如何将这些结果转换为我需要的简单列。
如果我使用找到的分类函数 here,它只会将一个类别与每个字符串匹配:
categorize_food <- function(df, searchString, category) {
df$category <- "OTHER"
for(i in seq_along(searchString)) {
list <- grep(searchString[i], df[,1], ignore.case=TRUE)
if (length(list) > 0) {
df$category[list] <- category[i]
}
}
df
}
recipes_cat <- categorize_food(recipes_df, category_df$items, category_df$category)
同样,找到的函数 与我正在寻找的最接近,但我不明白为什么类别编号会这样映射。我希望蔬菜类别是 1 而不是 2,奶制品类别是 5 而不是 3。
vec = category_df$items
recipes_df$category = apply(recipes_df, 1, function(u){
bool = sapply(vec, function(x) grepl(x, u[['recipes']]))
if(any(bool)) vec[bool] else NA
})
接近末尾的聚合对于大型数据集来说有点慢,所以或许可以寻找一种更快的方法(data.table?)将行转换为字符串,但这通常应该有效:
tmplist <- strsplit(items, "|", fixed=TRUE)
#Removes horrid '|' separated values into neat rows
searchterms <- data.frame(category=rep(category, sapply(tmplist, length)),
items=unlist(tmplist), stringsAsFactors=FALSE)
#Recreates data frame, neatly
res <- lapply(searchterms$items, grep, x=recipes, value=TRUE)
#throws an lapply on the neat data pattern against recipes
matched_times <- sapply(res, length)
df_matched <- data.frame( category = rep(searchterms$category[matched_times!=0],
matched_times[matched_times != 0]),
recipes = unlist(res))
# Combines category names the correct nr of times with grep
#results (recipe names), to create a tidy result
df_ummatched <- data.frame( category = NA, recipes = recipes[!recipes %in% unlist(res)])
df <- rbind(df_matched, df_ummatched)
#gets the nonmatched, plops it in with NA values.
final <- aggregate(category~recipes, data=df, paste, sep=",", na.action=na.pass)
#makes the data untidy, as you asked.
但这仍然会给我们留下重复的 vegetable, vegetable
条目。不能这样:
SplitFunction <- function(x) {
b <- unlist(strsplit(x, ','))
c <- b[!duplicated(b)]
return(paste(c, collapse=", "))
}
SplitFunctionV <- Vectorize(SplitFunction)
final$category <- SplitFunctionV(final$category)
结果:
final
recipes category
1 a bunch of bananas fruit
2 broccoli or spinach vegetable
3 fresh asparagus vegetable
4 I like apples, asparagus, and pork vegetable, fruit, meat
5 meats like lamb meat
6 one pound pork meat
7 scrambled or poached eggs dairy
8 sourdough english muffins bread
9 spinach and arugula salad vegetable
10 venison sausage and fried eggs meat, dairy
11 no fruits, no veggies, no nothing NA
这是一个非常简单的 tidyverse
选项:
library(tidyverse)
# reformat category data frame so each item has its own line:
category_df <-
category_df %>%
mutate(items = str_split(items, "\|")) %>%
unnest()
# then use string_extract_all() to find every item in each recipe string:
recipes_df %>%
mutate(recipe_category = str_extract_all(recipes, paste(category_df$items, collapse = '|')))
我有一个数据框,其中有一列字符串需要根据另一个数据框进行分类,该数据框在一列中具有类别标签,在另一列中具有匹配的 terms/patterns。
有 50 多个类别,每个字符串可以匹配多个类别,而其他字符串则没有匹配项。如何使用类别标签有效地标记这些字符串?
下面是一个简单的示例数据集和我希望得到的输出。如果有什么不同的话,真实数据集中的字符串比这些样本字符串长得多,有几十万个。
recipes <- c('fresh asparagus', 'a bunch of bananas', 'one pound pork', 'no fruits, no veggies, no nothing', 'broccoli or spinach','I like apples, asparagus, and pork', 'meats like lamb', 'venison sausage and fried eggs', 'spinach and arugula salad', 'scrambled or poached eggs', 'sourdough english muffins')
recipes_df <- data.frame(recipes, stringsAsFactors = FALSE)
category <- c('vegetable', 'fruit', 'meat','bread','dairy')
items <- c('arugula|asparagus|broccoli|peas|spinach', 'apples|bananas|blueberries|oranges', 'lamb|pork|turkey|venison', 'sourdough', 'buttermilk|butter|cream|eggs')
category_df <- data.frame(category, items)
这是我希望得到的输出:
recipes recipes_category
1 fresh asparagus vegetable
2 a bunch of bananas fruit
3 one pound pork meat
4 no fruits, no veggies, no nothing <NA>
5 broccoli or spinach vegetable
6 I like apples, asparagus, and pork fruit, vegetable, meat
7 meats like lamb meat
8 venison sausage and fried eggs meat, dairy
9 spinach and arugula salad vegetable
10 scrambled or poached eggs dairy
11 sourdough english muffins breads
我认为 grepl 和 for 循环的某种组合或 apply 的某个版本是必要的,但我在下面尝试的示例确实暴露了我对 R 的了解有多么少。例如,使用 sapply 给出了我期望的结果, sapply(category_df$items, grepl, recipes_df$recipes)
但我不确定如何将这些结果转换为我需要的简单列。
如果我使用找到的分类函数 here,它只会将一个类别与每个字符串匹配:
categorize_food <- function(df, searchString, category) {
df$category <- "OTHER"
for(i in seq_along(searchString)) {
list <- grep(searchString[i], df[,1], ignore.case=TRUE)
if (length(list) > 0) {
df$category[list] <- category[i]
}
}
df
}
recipes_cat <- categorize_food(recipes_df, category_df$items, category_df$category)
同样,找到的函数
vec = category_df$items
recipes_df$category = apply(recipes_df, 1, function(u){
bool = sapply(vec, function(x) grepl(x, u[['recipes']]))
if(any(bool)) vec[bool] else NA
})
接近末尾的聚合对于大型数据集来说有点慢,所以或许可以寻找一种更快的方法(data.table?)将行转换为字符串,但这通常应该有效:
tmplist <- strsplit(items, "|", fixed=TRUE)
#Removes horrid '|' separated values into neat rows
searchterms <- data.frame(category=rep(category, sapply(tmplist, length)),
items=unlist(tmplist), stringsAsFactors=FALSE)
#Recreates data frame, neatly
res <- lapply(searchterms$items, grep, x=recipes, value=TRUE)
#throws an lapply on the neat data pattern against recipes
matched_times <- sapply(res, length)
df_matched <- data.frame( category = rep(searchterms$category[matched_times!=0],
matched_times[matched_times != 0]),
recipes = unlist(res))
# Combines category names the correct nr of times with grep
#results (recipe names), to create a tidy result
df_ummatched <- data.frame( category = NA, recipes = recipes[!recipes %in% unlist(res)])
df <- rbind(df_matched, df_ummatched)
#gets the nonmatched, plops it in with NA values.
final <- aggregate(category~recipes, data=df, paste, sep=",", na.action=na.pass)
#makes the data untidy, as you asked.
但这仍然会给我们留下重复的 vegetable, vegetable
条目。不能这样:
SplitFunction <- function(x) {
b <- unlist(strsplit(x, ','))
c <- b[!duplicated(b)]
return(paste(c, collapse=", "))
}
SplitFunctionV <- Vectorize(SplitFunction)
final$category <- SplitFunctionV(final$category)
结果:
final
recipes category
1 a bunch of bananas fruit
2 broccoli or spinach vegetable
3 fresh asparagus vegetable
4 I like apples, asparagus, and pork vegetable, fruit, meat
5 meats like lamb meat
6 one pound pork meat
7 scrambled or poached eggs dairy
8 sourdough english muffins bread
9 spinach and arugula salad vegetable
10 venison sausage and fried eggs meat, dairy
11 no fruits, no veggies, no nothing NA
这是一个非常简单的 tidyverse
选项:
library(tidyverse)
# reformat category data frame so each item has its own line:
category_df <-
category_df %>%
mutate(items = str_split(items, "\|")) %>%
unnest()
# then use string_extract_all() to find every item in each recipe string:
recipes_df %>%
mutate(recipe_category = str_extract_all(recipes, paste(category_df$items, collapse = '|')))