从R中的字符串中提取单词
Extract words from string in R
我正在尝试提取字符串片段并从那些匹配的模式中创建新变量。我已经尝试了 "strings" 包中的许多功能,但似乎无法获得结果。下面的例子是由数据组成的。我想获取一个字符串并提取片段并将它们存储到新数据框的新列中。
例子
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)","Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es),"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)","Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)","The Remains (2016) 1080p BlurayHorror (openload.co)" ,"Suicide Squad (2016) HDAction (openload.co)")
>ex
[1] "The Accountant (2016)Crime (vodmovies112.blogspot.com.es)"
[2] "Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)"
[3] "Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)"
[4] "Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)"
[5] "The Remains (2016) 1080p BlurayHorror (openload.co)"
[6] "Suicide Squad (2016) HDAction (openload.co)"
genres <- c("Action","Adventure","Animation","Biography",
"Comedy","Crime","Documentary","Drama","Family",
"Fantasy","Film-Noir","History","Horror","Music",
"Musical","Mystery","Romance","Sci-Fi","Sport","Thriller",
"War","Western")
genres <- paste0("^",genres,"|")
genres[22] <- "^Western"
> genres
[1] "^Action|" "^Adventure|" "^Animation|" "^Biography|"
[5] "^Comedy|" "^Crime|" "^Documentary|" "^Drama|"
[9] "^Family|" "^Fantasy|" "^Film-Noir|" "^History|"
[13] "^Horror|" "^Music|" "^Musical|" "^Mystery|"
[17] "^Romance|" "^Sci-Fi|" "^Sport|" "^Thriller|"
[21] "^War|" "^Western"
努力完成
> df
title year domain genre
1 The Accountant 2016 vodmovies112.blogspot.com.es Crime
有一种可能:
temp <- strsplit(ex, "\(|\)")
df <- setNames(as.data.frame(lapply(1:4,function(i) sapply(temp,"[",i)), stringsAsFactors = FALSE), c("title", "year", "genre", "domain"))
df <- df[ , c("title", "year", "domain", "genre")]
correct <- sapply(seq_along(df$genre), function(y) which(lengths(sapply(seq_along(genres), function(x) grep(genres[x], df$genre[y])))>0))
correct <- lapply(correct, function(x) paste0(genres[x], collapse = " "))
df$genre <- unlist(correct)
df
# title year domain genre
# 1 The Accountant 2016 vodmovies112.blogspot.com.es Crime
# 2 Miss Peregrine's Home for Peculiar Children 2016 vodmovies112.blogspot.com.es Fantasy Sci-Fi
# 3 Fantastic Beasts And Where To Find Them 2016 openload.co Adventure
# 4 Ben-Hur 2016 vodmovies112.blogspot.com.es Action Adventure
# 5 The Remains 2016 openload.co Horror
# 6 Suicide Squad 2016 openload.co Action
基本上,我们将向量 ex
分成 4 部分,用括号分隔。然后我们用 4 列创建 data.frame df
。
最难的部分是正确提取类型(因为每部电影可能有不止一种类型)。我使用 sapply
、lapply
和 grep
的组合来做到这一点。完成后,我们 "correct" 专栏类型。
这是您的数据:
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)",
"Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)",
"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)",
"Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)",
"The Remains (2016) 1080p BlurayHorror (openload.co)", "Suicide Squad (2016) HDAction (openload.co)"
)
genres <- c("Action", "Adventure", "Animation", "Biography", "Comedy",
"Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir",
"History", "Horror", "Music", "Musical", "Mystery", "Romance",
"Sci-Fi", "Sport", "Thriller", "War", "Western")
使用 tidyverse 的另一种可能性:
library(tidyverse)
data_frame(x = ex) %>%
extract(
x,
c("title", "year", "domain", "genre"),
"(^[^(]+)\s+\((\d{4})\)\s*([^(]+)\s+\(([^)]+)"
)
## title year domain genre
## * <chr> <chr> <chr> <chr>
## 1 The Accountant 2016 Crime vodmovies112.blogspot.com.es
## 2 Miss Peregrine's Home for Peculiar Children 2016 FantasySci-Fi vodmovies112.blogspot.com.es
## 3 Fantastic Beasts And Where To Find Them 2016 TSAdventure openload.co
## 4 Ben-Hur 2016 HDActionAdventure vodmovies112.blogspot.com.es
## 5 The Remains 2016 1080p BlurayHorror openload.co
## 6 Suicide Squad 2016 HDAction openload.co
我正在尝试提取字符串片段并从那些匹配的模式中创建新变量。我已经尝试了 "strings" 包中的许多功能,但似乎无法获得结果。下面的例子是由数据组成的。我想获取一个字符串并提取片段并将它们存储到新数据框的新列中。
例子
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)","Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es),"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)","Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)","The Remains (2016) 1080p BlurayHorror (openload.co)" ,"Suicide Squad (2016) HDAction (openload.co)")
>ex
[1] "The Accountant (2016)Crime (vodmovies112.blogspot.com.es)"
[2] "Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)"
[3] "Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)"
[4] "Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)"
[5] "The Remains (2016) 1080p BlurayHorror (openload.co)"
[6] "Suicide Squad (2016) HDAction (openload.co)"
genres <- c("Action","Adventure","Animation","Biography",
"Comedy","Crime","Documentary","Drama","Family",
"Fantasy","Film-Noir","History","Horror","Music",
"Musical","Mystery","Romance","Sci-Fi","Sport","Thriller",
"War","Western")
genres <- paste0("^",genres,"|")
genres[22] <- "^Western"
> genres
[1] "^Action|" "^Adventure|" "^Animation|" "^Biography|"
[5] "^Comedy|" "^Crime|" "^Documentary|" "^Drama|"
[9] "^Family|" "^Fantasy|" "^Film-Noir|" "^History|"
[13] "^Horror|" "^Music|" "^Musical|" "^Mystery|"
[17] "^Romance|" "^Sci-Fi|" "^Sport|" "^Thriller|"
[21] "^War|" "^Western"
努力完成
> df
title year domain genre
1 The Accountant 2016 vodmovies112.blogspot.com.es Crime
有一种可能:
temp <- strsplit(ex, "\(|\)")
df <- setNames(as.data.frame(lapply(1:4,function(i) sapply(temp,"[",i)), stringsAsFactors = FALSE), c("title", "year", "genre", "domain"))
df <- df[ , c("title", "year", "domain", "genre")]
correct <- sapply(seq_along(df$genre), function(y) which(lengths(sapply(seq_along(genres), function(x) grep(genres[x], df$genre[y])))>0))
correct <- lapply(correct, function(x) paste0(genres[x], collapse = " "))
df$genre <- unlist(correct)
df
# title year domain genre
# 1 The Accountant 2016 vodmovies112.blogspot.com.es Crime
# 2 Miss Peregrine's Home for Peculiar Children 2016 vodmovies112.blogspot.com.es Fantasy Sci-Fi
# 3 Fantastic Beasts And Where To Find Them 2016 openload.co Adventure
# 4 Ben-Hur 2016 vodmovies112.blogspot.com.es Action Adventure
# 5 The Remains 2016 openload.co Horror
# 6 Suicide Squad 2016 openload.co Action
基本上,我们将向量 ex
分成 4 部分,用括号分隔。然后我们用 4 列创建 data.frame df
。
最难的部分是正确提取类型(因为每部电影可能有不止一种类型)。我使用 sapply
、lapply
和 grep
的组合来做到这一点。完成后,我们 "correct" 专栏类型。
这是您的数据:
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)",
"Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)",
"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)",
"Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)",
"The Remains (2016) 1080p BlurayHorror (openload.co)", "Suicide Squad (2016) HDAction (openload.co)"
)
genres <- c("Action", "Adventure", "Animation", "Biography", "Comedy",
"Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir",
"History", "Horror", "Music", "Musical", "Mystery", "Romance",
"Sci-Fi", "Sport", "Thriller", "War", "Western")
使用 tidyverse 的另一种可能性:
library(tidyverse)
data_frame(x = ex) %>%
extract(
x,
c("title", "year", "domain", "genre"),
"(^[^(]+)\s+\((\d{4})\)\s*([^(]+)\s+\(([^)]+)"
)
## title year domain genre
## * <chr> <chr> <chr> <chr>
## 1 The Accountant 2016 Crime vodmovies112.blogspot.com.es
## 2 Miss Peregrine's Home for Peculiar Children 2016 FantasySci-Fi vodmovies112.blogspot.com.es
## 3 Fantastic Beasts And Where To Find Them 2016 TSAdventure openload.co
## 4 Ben-Hur 2016 HDActionAdventure vodmovies112.blogspot.com.es
## 5 The Remains 2016 1080p BlurayHorror openload.co
## 6 Suicide Squad 2016 HDAction openload.co