将列表中的多个项目匹配到 R 中的字符串
Match multiple items in list to string in R
我有以下数据框,正在努力检测单独字符串元素中的列表项。
在以下数据框中:
original_df <- structure(list(title = c("Film Review: Almost Christmas", "Film Review: Mascots",
"Women s Basketball Upstages No. 2 California Baptist", "Men s Basketball Goes 2-0 In Opening Home Matchups",
"Women s Soccer Wins 16th Consecutive Game, Moves Onto Third Round of Tournament",
"The Hype About Hullabaloo"), tags = c("[u'Arts & Entertainment', u'Films & TV', u'Trending', u'Almost Christmas', u'Danny Glover', u'David E. Talbert', u'family', u'Film', u'Gabrielle Union', u'Holiday', u'JB Smoove', u'movie', u'review']",
"[u'Arts & Entertainment', u'Films & TV', u'Homepage', u'Trending', u'Chris O\u2019Dowd', u'Christopher Guest', u'Ed Begley Jr.', u'Film', u'Fred Willard', u'Jane Lynch', u'Mascots', u'movie', u'Netflix', u'Parker Posey', u'review', u'Spinal Tap']",
"[u'Basketball', u'Homepage', u'Sports', u'Trending', u'Beth Mounier', u'cassie macleod', u'Dalayna Sampton', u'Joleen Yang', u'Mikayla Williams', u'Taylor Tanita', u'UCSD', u\"Women's Basketball\"]",
"[u'Basketball', u'Homepage', u'Sports', u'Trending', u'Adam Klie', u'Azusa Pacific University', u'CCAA', u'Dixie State', u\"Men's Basketball\", u'Tritons', u'UCSD']",
"[u'Homepage', u'Soccer', u'Sports', u'Trending', u'Azusa Pacific', u'Jordyn McNutt', u\"Katie O'Laughlin\", u'Mary Reilly', u'NCAA Division-II', u'UCSD', u\"Women's Soccer\"]",
"[u'Arts & Entertainment', u'Music', u'Slider', u'AS', u'asce', u'Concerts', u'Council', u\"Founder's Day\", u'Hullabaloo', u'Isaiah Rashad', u'Rap', u'Responsible Action Protocol', u'sun god', u'UCSD']"
)), .Names = c("title", "tags"), row.names = 215:220, class = "data.frame")
有标题栏和标签栏。出于数据操作的原因,标签列不是列表。它是一个看起来像数组的字符串。
我有一个名为运动的单独列表,它是各种运动的列表。
sports <- c("Basketball", "Soccer", "Baseball")
我想在原始数据框中创建一个新列来指示检测到的运动。
我开始使用 grepl 并创建了以下函数:
detectSports <- function(sport_item){
sport_in_tag <- grepl(tolower(sport_item),tolower(original_df$tags))
sport_in_tag
}
并将此函数应用于运动列表:
ss <- lapply(sports, detectSports)
结果是一个包含逻辑向量的列表。
我无法将其与我原来的 dataframe.I 相匹配,我相信我可以使用 colnames 但不太确定它是如何工作的。
感谢任何建议!
谢谢
假设你每行最多有一个与任何运动的比赛(如果你同时有多个比赛,这些运动将用逗号分隔),你可以尝试以下(没有显示与任何运动的比赛)通过 original_df 中新列运动中的空白字符):
original_df$sports <- unlist(apply(t(do.call(rbind, lapply(sports, detectSports))), 1,
function(x) ifelse (any(x), paste(sports[which(x)], collapse=','), '')))
original_df$sports
# [1] "" "" "Basketball" "Basketball" "Soccer" ""
如果您刚刚这样做(将一个未命名的三项列表分配给 3 个新命名的列,每个列表的长度都正确),您会得到有用的结果:
original_df[ , sports] <- ss
#examine results
original_df[ , !names(original_df) %in% "tags"]
title Basketball Soccer Baseball
215 Film Review: Almost Christmas FALSE FALSE FALSE
216 Film Review: Mascots FALSE FALSE FALSE
217 Women s Basketball Upstages No. 2 California Baptist TRUE FALSE FALSE
218 Men s Basketball Goes 2-0 In Opening Home Matchups TRUE FALSE FALSE
219 Women s Soccer Wins 16th Consecutive Game, Moves Onto Third Round of Tournament FALSE TRUE FALSE
220 The Hype About Hullabaloo FALSE FALSE FALSE
我有以下数据框,正在努力检测单独字符串元素中的列表项。 在以下数据框中:
original_df <- structure(list(title = c("Film Review: Almost Christmas", "Film Review: Mascots",
"Women s Basketball Upstages No. 2 California Baptist", "Men s Basketball Goes 2-0 In Opening Home Matchups",
"Women s Soccer Wins 16th Consecutive Game, Moves Onto Third Round of Tournament",
"The Hype About Hullabaloo"), tags = c("[u'Arts & Entertainment', u'Films & TV', u'Trending', u'Almost Christmas', u'Danny Glover', u'David E. Talbert', u'family', u'Film', u'Gabrielle Union', u'Holiday', u'JB Smoove', u'movie', u'review']",
"[u'Arts & Entertainment', u'Films & TV', u'Homepage', u'Trending', u'Chris O\u2019Dowd', u'Christopher Guest', u'Ed Begley Jr.', u'Film', u'Fred Willard', u'Jane Lynch', u'Mascots', u'movie', u'Netflix', u'Parker Posey', u'review', u'Spinal Tap']",
"[u'Basketball', u'Homepage', u'Sports', u'Trending', u'Beth Mounier', u'cassie macleod', u'Dalayna Sampton', u'Joleen Yang', u'Mikayla Williams', u'Taylor Tanita', u'UCSD', u\"Women's Basketball\"]",
"[u'Basketball', u'Homepage', u'Sports', u'Trending', u'Adam Klie', u'Azusa Pacific University', u'CCAA', u'Dixie State', u\"Men's Basketball\", u'Tritons', u'UCSD']",
"[u'Homepage', u'Soccer', u'Sports', u'Trending', u'Azusa Pacific', u'Jordyn McNutt', u\"Katie O'Laughlin\", u'Mary Reilly', u'NCAA Division-II', u'UCSD', u\"Women's Soccer\"]",
"[u'Arts & Entertainment', u'Music', u'Slider', u'AS', u'asce', u'Concerts', u'Council', u\"Founder's Day\", u'Hullabaloo', u'Isaiah Rashad', u'Rap', u'Responsible Action Protocol', u'sun god', u'UCSD']"
)), .Names = c("title", "tags"), row.names = 215:220, class = "data.frame")
有标题栏和标签栏。出于数据操作的原因,标签列不是列表。它是一个看起来像数组的字符串。
我有一个名为运动的单独列表,它是各种运动的列表。
sports <- c("Basketball", "Soccer", "Baseball")
我想在原始数据框中创建一个新列来指示检测到的运动。 我开始使用 grepl 并创建了以下函数:
detectSports <- function(sport_item){
sport_in_tag <- grepl(tolower(sport_item),tolower(original_df$tags))
sport_in_tag
}
并将此函数应用于运动列表:
ss <- lapply(sports, detectSports)
结果是一个包含逻辑向量的列表。 我无法将其与我原来的 dataframe.I 相匹配,我相信我可以使用 colnames 但不太确定它是如何工作的。
感谢任何建议! 谢谢
假设你每行最多有一个与任何运动的比赛(如果你同时有多个比赛,这些运动将用逗号分隔),你可以尝试以下(没有显示与任何运动的比赛)通过 original_df 中新列运动中的空白字符):
original_df$sports <- unlist(apply(t(do.call(rbind, lapply(sports, detectSports))), 1,
function(x) ifelse (any(x), paste(sports[which(x)], collapse=','), '')))
original_df$sports
# [1] "" "" "Basketball" "Basketball" "Soccer" ""
如果您刚刚这样做(将一个未命名的三项列表分配给 3 个新命名的列,每个列表的长度都正确),您会得到有用的结果:
original_df[ , sports] <- ss
#examine results
original_df[ , !names(original_df) %in% "tags"]
title Basketball Soccer Baseball
215 Film Review: Almost Christmas FALSE FALSE FALSE
216 Film Review: Mascots FALSE FALSE FALSE
217 Women s Basketball Upstages No. 2 California Baptist TRUE FALSE FALSE
218 Men s Basketball Goes 2-0 In Opening Home Matchups TRUE FALSE FALSE
219 Women s Soccer Wins 16th Consecutive Game, Moves Onto Third Round of Tournament FALSE TRUE FALSE
220 The Hype About Hullabaloo FALSE FALSE FALSE