如果另一列中的字符串包含标点符号和字体大小不同的单词,如何创建取 1 的新变量?
How to make new variable that takes 1 if the string in another column contains a word with varying punctuation and font size?
我有一个看起来像这样的专栏
col1
"business"
"BusinesS"
"education"
"some BUSINESS ."
"business of someone, that is cool"
" not the b word"
"busi ness"
"busines."
"businesses"
"something else"
我需要一种有效的方法将所有这些字符串数据转换为新值
col1 col2
NA 1
NA 1
"education" NA
NA 1
NA 1
" not the b word" NA
NA 1
NA 1
NA 1
"something else" NA
所以共同点是“busines”,但我不知道如何有效地将所有空格、标点符号、lower/uppercases、其他词等整理到一个创建新列。
library(dplyr)
library(stringr)
df %>%
mutate(col2 = ifelse(str_detect(col1, "(?i)busi\s?ness?"),
1,
NA)
如果str_detect
检测到任何形式的business
,我们可以使用ifelse
设置1
,如果没有检测到NA
。请注意,(?i)
使 \s?
中的 case-insensitive 和 ?
匹配,而 s?
使前面的项目可选;所以 \s?
匹配一个可选的 space 并且 s?
匹配一个可选的文字 s
您可以使用 gsub
替换所有非单词字符,然后使用 grepl
检测 busines
:
+grepl("busines", gsub("\W+", "", s), ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
另一种方法是使用 agrepl
进行 近似字符串匹配,其中 1L
给出到给定模式的最大距离。
+agrepl("busines", s, 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
如果您正在寻找 business
而不是 busines
:,agrep
也可以作为解决方案
+agrepl("business", gsub("\W+", "", s), 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
数据:
s <- c("business","BusinesS","education","some BUSINESS .",
"business of someone, that is cool"," not the b word",
"busi ness","busines." ,"businesses","something else")
我有一个看起来像这样的专栏
col1
"business"
"BusinesS"
"education"
"some BUSINESS ."
"business of someone, that is cool"
" not the b word"
"busi ness"
"busines."
"businesses"
"something else"
我需要一种有效的方法将所有这些字符串数据转换为新值
col1 col2
NA 1
NA 1
"education" NA
NA 1
NA 1
" not the b word" NA
NA 1
NA 1
NA 1
"something else" NA
所以共同点是“busines”,但我不知道如何有效地将所有空格、标点符号、lower/uppercases、其他词等整理到一个创建新列。
library(dplyr)
library(stringr)
df %>%
mutate(col2 = ifelse(str_detect(col1, "(?i)busi\s?ness?"),
1,
NA)
如果str_detect
检测到任何形式的business
,我们可以使用ifelse
设置1
,如果没有检测到NA
。请注意,(?i)
使 \s?
中的 case-insensitive 和 ?
匹配,而 s?
使前面的项目可选;所以 \s?
匹配一个可选的 space 并且 s?
匹配一个可选的文字 s
您可以使用 gsub
替换所有非单词字符,然后使用 grepl
检测 busines
:
+grepl("busines", gsub("\W+", "", s), ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
另一种方法是使用 agrepl
进行 近似字符串匹配,其中 1L
给出到给定模式的最大距离。
+agrepl("busines", s, 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
如果您正在寻找 business
而不是 busines
:,agrep
也可以作为解决方案
+agrepl("business", gsub("\W+", "", s), 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
数据:
s <- c("business","BusinesS","education","some BUSINESS .",
"business of someone, that is cool"," not the b word",
"busi ness","busines." ,"businesses","something else")