使用模式列表对新字段进行编码

Using lists of patterns to code a new field

我想使用表达式列表对新字段进行编码。

在我的数据框中,Bisaccategory1 包含图书类别的完整描述。表示此字段中部分值的特定字符串可用于定义名为 "Genre" 的新字段。一种特定的流派是 "nonfiction",它映射到 25 个独特的完整描述。我可以通过指定其中包含的某些模式来识别这些完整描述:

  nonfiction<-c("BIOGRAPHY & AUTOBIOGRAPHY","BODY, MIND & SPIRIT","BUSINESS & ECONOMICS","COMICS & GRAPHIC NOVELS",
                  "COMPUTERS","COOKING","FAMILY & RELATIONSHIPS","HEALTH & FITNESS","HISTORY","HOUSE & HOME","HUMOR",
                  "LITERARY CRITICISM","NATURE","PERFORMING 
ARTS","PETS","PHOTOGRAPHY","POETRY","POLITICAL SCIENCE","RELIGION",
                      "SCIENCE","SELF-HELP","SOCIAL SCIENCE","SPORTS & RECREATION","TRANSPORTATION","TRUE CRIME")

然后我可以匹配这些字符串来完成 Biscategory1 值,如下所示:

matches <- unique (grep(paste(nonfiction,collapse="|"), 
                                detail$Bisaccategory1, value=TRUE))

但我不清楚如何使用这些 "matches" 将值 "nonfiction" 分配给我的新流派字段。

这是示例数据:

structure(list(Author = c("James Swallow", "Billy Crystal", "Mark Divine", 
"Charles Cumming", "Victoria Schwab", "Louise Penny", "Elizabeth Warren", 
"Linda Castillo", "Paul Fischer", "Sandy Hall", "Louise Penny", 
"Louise Penny", "Lisa Scottoline", "Linda Castillo", "Evan Osnos", 
"Porter Erisman"), Title = c("24: Deadline", "700 Sundays - Still Foolin' 'Em", 
"8 Weeks to Sealfit", "A Colder War", "A Dark Shade of Magic", 
"A Fatal Grace", "A Fighting Chance", "A Hidden Secret", "A Kim Jong-Il Production", 
"A Little Something Different", "A Rule Against Murder", "A Trick of the Light", 
"Accused", "After the Storm", "Age of Ambition", "Alibaba's World"
), Bisac = c("FICTION / Thrillers / General", "BIOGRAPHY & AUTOBIOGRAPHY / Entertainment & Performing Arts", 
"HEALTH & FITNESS / Exercise", "FICTION / Thrillers / Espionage", 
"FICTION / Fantasy / Historical", "FICTION / Mystery & Detective / Traditional", 
"BIOGRAPHY & AUTOBIOGRAPHY / Political", "FICTION / Mystery & Detective / Police Procedural", 
"HISTORY / Asia / Korea", "JUVENILE FICTION / Love & Romance", 
"FICTION / Mystery & Detective / Traditional", "FICTION / Mystery & Detective / Traditional", 
"FICTION / Thrillers / Legal", "FICTION / Mystery & Detective / Police Procedural", 
"HISTORY / Asia / China", "BUSINESS & ECONOMICS / E-Commerce / General"
)), .Names = c("Author", "Title", "Bisac"), class = "data.frame", row.names = c(NA, 
-16L))

我知道我可以做类似的事情:

df$Genre[Bisaccategory1=="BODY, MIND & SPIRIT / Inspiration & Personal Growth"]<-"nonfiction"

但我有数百个类别,这并不是真正可扩展的。如果有任何建议,我将不胜感激。

而不是 grep 函数 grepl 将 return 一个进行匹配的逻辑索引。您可以使用它来对 Genre 列进行子集化。我将非 "non-fiction" 的条目分配给小说,但您可以随意制作它们。

matches <- grepl(paste(nonfiction,collapse="|"), detail$Bisac)
detail$Genre <- "fiction"
detail$Genre[matches] <- "non-fiction"
# Bisac       Genre
# 1                                FICTION / Thrillers / General     fiction
# 2  BIOGRAPHY & AUTOBIOGRAPHY / Entertainment & Performing Arts non-fiction
# 3                                  HEALTH & FITNESS / Exercise non-fiction
# 4                              FICTION / Thrillers / Espionage     fiction
# 5                               FICTION / Fantasy / Historical     fiction
# 6                  FICTION / Mystery & Detective / Traditional     fiction
# 7                        BIOGRAPHY & AUTOBIOGRAPHY / Political non-fiction
library(dplyr)
library(tidyr)
library(stringi)

non_fiction_books = 
  detail %>%
  mutate(Bisac = Bisac %>% stri_split_fixed(" / ") ) %>%
  unnest(Bisac) %>%
  mutate(Bisac = Bisac %>% stri_trans_toupper) %>%
  right_join(data_frame(Bisac = non_fiction) ) %>%
  select(-Bisac) %>%
  distinct