如何从 R 中的数据框列中删除所有未指定的指定单词
How to remove all unspecified specified words from dataframe column in R
我有一个带有 Twitter BIOS 格式的数据框,格式如下 table。
account
bio
38374
i love candy as much as life itself proud liberal
45673
can all just get along
94928
conserv christian mom and proud pro trump veteran maga
11204
professor of women and gender studies at wesleyan university blacklivesmatter
37465
former ohio state football coach now a proud papa to seven grandchildren
许多关于堆栈溢出的回复询问如何从数据框列中删除指定的单词列表
(如 and )。但我想删除 bio 列中的所有单词,除非它们出现在预先确定的单词列表中。要保留的单词列表由 1052 个单词组成(如下所示)
> termstokeep
[1] love life follow live just like music regist trademark
[10] make fan one copyright lover thing world time god
[19] can get design peopl artist girl univers writer will
[28] student work busi good new know friend famili best
[37] day account market sport art game manag want book
[46] enthusiast person alway travel never free real help dream
[55] servic mom husband profession beauti offici wife now news
[64] social food come father heart educ develop need anim
[73] everyth proud tri year happi also media way man
[82] team produc look state take back support director home
[91] find call engin learn provid photograph great author video
[100] guy communiti coach name big passion see teacher school
[109] product sinc gamer enjoy keep player better let believ
[118] mother think mind dog futur give colleg say owner
[127] jesus fun got littl chang founder boy use first
[136] liberal write footbal kid fuck event polit consult care
[145] conserv much health technolog tech opinion stay everi right
[154] full former member special well young high creat snap
[163] entrepreneur movi feel view compani coffe cat citi human
[172] digit show singer sometim interest dad watch scienc creativ
[181] blogger base addict fit read bless fashion part noth
[190] run forev editor born hard die around onlin nerd
[199] class web musician made stuff leader ever inspir still
[208] christian place current public danc pleas geek talk film
[217] realli babi someth page rock lot women lead two
理想情况下,删除所有未指定的词后,数据框将如下所示:
account
bio
38374
love life proud liberal
45673
94928
conserv christian mom proud pro trump veteran maga
11204
professor women gender university blacklivesmatter
37465
ohio state football coach proud grandchildren
如何做到这一点?
这是一种使用基数 gregexpr
和 regmatches
的方法。
pattern <- paste0("\<", termstokeep, "\>")
pattern <- paste(pattern, collapse = "|")
m <- gregexpr(pattern, df1$bio)
r <- regmatches(df1$bio, m)
df1$bio_clean <- sapply(r, paste, collapse = " ")
由 reprex package (v2.0.1)
创建于 2022-02-22
数据
termstokeep <-
c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")
df1 <- read.table(text = "
account bio
38374 'i love candy as much as life itself proud liberal'
45673 'can all just get along'
94928 'conserv christian mom and proud pro trump veteran maga'
11204 'professor of women and gender studies at wesleyan university blacklivesmatter'
37465 'former ohio state football coach now a proud papa to seven grandchildren'
", header = TRUE)
由 reprex package (v2.0.1)
创建于 2022-02-22
可能的解决方案,基于tidyverse
:
library(tidyverse)
df %>%
rowwise %>%
mutate(bio = str_split(bio, "\s") %>% unlist %>% intersect(words) %>%
str_c(collapse = " ")) %>%
ungroup
#> # A tibble: 5 x 2
#> account bio
#> <int> <chr>
#> 1 38374 love much life proud liberal
#> 2 45673 can just get
#> 3 94928 conserv christian mom proud
#> 4 11204 women
#> 5 37465 former state coach now proud
这是另一个基础 R 选项:
df$bio <- sapply(lapply(strsplit(df$bio, "\s"), intersect, termstokeep),
paste, collapse = " ")
输出
account bio
1 38374 love much life proud liberal
2 45673 can just get
3 94928 conserv christian mom proud
4 11204 women
5 37465 former state coach now proud
数据(感谢@RuiBarradas!)
df <- structure(list(account = c(38374L, 45673L, 94928L, 11204L, 37465L
), bio = c("i love candy as much as life itself proud liberal",
"can all just get along", "conserv christian mom and proud pro trump veteran maga",
"professor of women and gender studies at wesleyan university blacklivesmatter",
"former ohio state football coach now a proud papa to seven grandchildren"
)), class = "data.frame", row.names = c(NA, -5L))
termstokeep <- c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")
我有一个带有 Twitter BIOS 格式的数据框,格式如下 table。
account | bio |
---|---|
38374 | i love candy as much as life itself proud liberal |
45673 | can all just get along |
94928 | conserv christian mom and proud pro trump veteran maga |
11204 | professor of women and gender studies at wesleyan university blacklivesmatter |
37465 | former ohio state football coach now a proud papa to seven grandchildren |
许多关于堆栈溢出的回复询问如何从数据框列中删除指定的单词列表
(如
> termstokeep
[1] love life follow live just like music regist trademark
[10] make fan one copyright lover thing world time god
[19] can get design peopl artist girl univers writer will
[28] student work busi good new know friend famili best
[37] day account market sport art game manag want book
[46] enthusiast person alway travel never free real help dream
[55] servic mom husband profession beauti offici wife now news
[64] social food come father heart educ develop need anim
[73] everyth proud tri year happi also media way man
[82] team produc look state take back support director home
[91] find call engin learn provid photograph great author video
[100] guy communiti coach name big passion see teacher school
[109] product sinc gamer enjoy keep player better let believ
[118] mother think mind dog futur give colleg say owner
[127] jesus fun got littl chang founder boy use first
[136] liberal write footbal kid fuck event polit consult care
[145] conserv much health technolog tech opinion stay everi right
[154] full former member special well young high creat snap
[163] entrepreneur movi feel view compani coffe cat citi human
[172] digit show singer sometim interest dad watch scienc creativ
[181] blogger base addict fit read bless fashion part noth
[190] run forev editor born hard die around onlin nerd
[199] class web musician made stuff leader ever inspir still
[208] christian place current public danc pleas geek talk film
[217] realli babi someth page rock lot women lead two
理想情况下,删除所有未指定的词后,数据框将如下所示:
account | bio |
---|---|
38374 | love life proud liberal |
45673 | |
94928 | conserv christian mom proud pro trump veteran maga |
11204 | professor women gender university blacklivesmatter |
37465 | ohio state football coach proud grandchildren |
如何做到这一点?
这是一种使用基数 gregexpr
和 regmatches
的方法。
pattern <- paste0("\<", termstokeep, "\>")
pattern <- paste(pattern, collapse = "|")
m <- gregexpr(pattern, df1$bio)
r <- regmatches(df1$bio, m)
df1$bio_clean <- sapply(r, paste, collapse = " ")
由 reprex package (v2.0.1)
创建于 2022-02-22数据
termstokeep <-
c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")
df1 <- read.table(text = "
account bio
38374 'i love candy as much as life itself proud liberal'
45673 'can all just get along'
94928 'conserv christian mom and proud pro trump veteran maga'
11204 'professor of women and gender studies at wesleyan university blacklivesmatter'
37465 'former ohio state football coach now a proud papa to seven grandchildren'
", header = TRUE)
由 reprex package (v2.0.1)
创建于 2022-02-22可能的解决方案,基于tidyverse
:
library(tidyverse)
df %>%
rowwise %>%
mutate(bio = str_split(bio, "\s") %>% unlist %>% intersect(words) %>%
str_c(collapse = " ")) %>%
ungroup
#> # A tibble: 5 x 2
#> account bio
#> <int> <chr>
#> 1 38374 love much life proud liberal
#> 2 45673 can just get
#> 3 94928 conserv christian mom proud
#> 4 11204 women
#> 5 37465 former state coach now proud
这是另一个基础 R 选项:
df$bio <- sapply(lapply(strsplit(df$bio, "\s"), intersect, termstokeep),
paste, collapse = " ")
输出
account bio
1 38374 love much life proud liberal
2 45673 can just get
3 94928 conserv christian mom proud
4 11204 women
5 37465 former state coach now proud
数据(感谢@RuiBarradas!)
df <- structure(list(account = c(38374L, 45673L, 94928L, 11204L, 37465L
), bio = c("i love candy as much as life itself proud liberal",
"can all just get along", "conserv christian mom and proud pro trump veteran maga",
"professor of women and gender studies at wesleyan university blacklivesmatter",
"former ohio state football coach now a proud papa to seven grandchildren"
)), class = "data.frame", row.names = c(NA, -5L))
termstokeep <- c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")