从因子字符串变量中提取唯一字符串

Extract unique strings from a factor string variable

我有一个包含演员姓名的变量。

(actor=structure(c(4L, 1L, 6L, 2L, 5L, 3L), .Label = c("Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman", 
"Jamie Foxx, Christoph Waltz, Leonardo DiCaprio, Kerry Washington", 
"Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Stanley Tucci", 
"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Ken Watanabe", 
"Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley, Max von Sydow", 
"Robert Downey Jr., Chris Evans, Scarlett Johansson, Jeremy Renner"
), class = "factor"))
# [1] Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Ken Watanabe
# [2] Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman            
# [3] Robert Downey Jr., Chris Evans, Scarlett Johansson, Jeremy Renner
# [4] Jamie Foxx, Christoph Waltz, Leonardo DiCaprio, Kerry Washington 
# [5] Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley, Max von Sydow     
# [6] Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Stanley Tucci
# 6 Levels: Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman ...

我想从中提取所有完整的演员姓名(姓名 + 姓氏)并将它们作为输出矩阵中的列。

如果您想提取演员的唯一名称,可以使用 as.character 函数获取指定的演员,用 strsplit 将其拆分为逗号,将结果中的所有向量组合在一起使用 unlist 列出,并使用 unique:

获取唯一名称
(all.actors <- unique(unlist(strsplit(as.character(actor), ", "))))
#  [1] "Leonardo DiCaprio"    "Joseph Gordon-Levitt" "Ellen Page"           "Ken Watanabe"        
#  [5] "Christian Bale"       "Tom Hardy"            "Anne Hathaway"        "Gary Oldman"         
#  [9] "Robert Downey Jr."    "Chris Evans"          "Scarlett Johansson"   "Jeremy Renner"       
# [13] "Jamie Foxx"           "Christoph Waltz"      "Kerry Washington"     "Mark Ruffalo"        
# [17] "Ben Kingsley"         "Max von Sydow"        "Jennifer Lawrence"    "Josh Hutcherson"     
# [21] "Liam Hemsworth"       "Stanley Tucci"    

通过使用 as.character(actor),此代码仅使用出现在因素 actor 中的参与者,即使该因素有更多未使用的级别。如果您改用 levels(actor),您将获得因子级别中的所有参与者,无论它们是否在 actors 中使用。您可以在定义 all.actors.

时使用您喜欢的任何一个

如果你想要一个矩阵来指示每个演员在 actor 的每个元素中的包含情况,你可以这样做

mat <- sapply(strsplit(as.character(actor), ", "), function(x) all.actors %in% x)
row.names(mat) <- all.actors
mat
#                       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# Leonardo DiCaprio     TRUE FALSE FALSE  TRUE  TRUE FALSE
# Joseph Gordon-Levitt  TRUE FALSE FALSE FALSE FALSE FALSE
# Ellen Page            TRUE FALSE FALSE FALSE FALSE FALSE
# Ken Watanabe          TRUE FALSE FALSE FALSE FALSE FALSE
# Christian Bale       FALSE  TRUE FALSE FALSE FALSE FALSE
# Tom Hardy            FALSE  TRUE FALSE FALSE FALSE FALSE
# Anne Hathaway        FALSE  TRUE FALSE FALSE FALSE FALSE
# Gary Oldman          FALSE  TRUE FALSE FALSE FALSE FALSE
# Robert Downey Jr.    FALSE FALSE  TRUE FALSE FALSE FALSE
# Chris Evans          FALSE FALSE  TRUE FALSE FALSE FALSE
# Scarlett Johansson   FALSE FALSE  TRUE FALSE FALSE FALSE
# Jeremy Renner        FALSE FALSE  TRUE FALSE FALSE FALSE
# Jamie Foxx           FALSE FALSE FALSE  TRUE FALSE FALSE
# Christoph Waltz      FALSE FALSE FALSE  TRUE FALSE FALSE
# Kerry Washington     FALSE FALSE FALSE  TRUE FALSE FALSE
# Mark Ruffalo         FALSE FALSE FALSE FALSE  TRUE FALSE
# Ben Kingsley         FALSE FALSE FALSE FALSE  TRUE FALSE
# Max von Sydow        FALSE FALSE FALSE FALSE  TRUE FALSE
# Jennifer Lawrence    FALSE FALSE FALSE FALSE FALSE  TRUE
# Josh Hutcherson      FALSE FALSE FALSE FALSE FALSE  TRUE
# Liam Hemsworth       FALSE FALSE FALSE FALSE FALSE  TRUE
# Stanley Tucci        FALSE FALSE FALSE FALSE FALSE  TRUE