如何根据 R 中的正则表达式匹配创建新的列数据
How do I create new column data based on regex match in R
我有一些推文作者位置数据,我希望将其重新分类到国家/地区。例如,获取美国的矢量 'states' 我想检查(正则表达式)匹配项并在国家/地区列中添加“美国”条目。
示例数据:
states = c("CA", "OH", "FL", "TX", "MN") # all the states
tweets$location = data.frame("my bed", "Minneapolis, MN", "Paris, France", "Los Angeles, CA")
我尝试过的:
# This seems to do the matching part well
filter(str_detect(location, paste(usa_data$Code, collapse = "|")))
# nested for loop
for (i in length(tweets$location)){
for (state in states){
if (grepl(state, tweets$location[i])){
tweets$country[i] = "USA"
break
}
}
}
期望的输出(基于示例输入):
tweets$country = data.frame(NA, "USA", NA, "USA")
我对 R 比较陌生,因此非常感谢任何帮助。
我们可以将 grepl
与 ifelse
一起用于基础 R 解决方案:
states = c("CA", "OH", "FL", "TX", "MN") # all the states
tweets$location = data.frame("my bed", "Minneapolis, MN", "Paris, France", "Los Angeles, CA")
regex <- paste0("\b(?:", paste(states, collapse="|"), ")\b")
tweets$country <- ifelse(grepl(regex, tweets$location), "USA", NA)
如果您更喜欢 dplyr
解决方案,但与 Tim 的回答非常相似
library(dplyr)
states <- c("CA", "OH", "FL", "TX", "MN") # all the states
tweets <- tibble(location = c(
"my bed", "Minneapolis, MN", "Paris, France",
"Los Angeles, CA"
))
tweets %>%
mutate(country = if_else(stringr::str_detect(
string = location,
pattern = paste0(
"\b(?:", paste(states,
collapse = "|"
),
")\b"
)
),
"United States", "NA"
))
我有一些推文作者位置数据,我希望将其重新分类到国家/地区。例如,获取美国的矢量 'states' 我想检查(正则表达式)匹配项并在国家/地区列中添加“美国”条目。
示例数据:
states = c("CA", "OH", "FL", "TX", "MN") # all the states
tweets$location = data.frame("my bed", "Minneapolis, MN", "Paris, France", "Los Angeles, CA")
我尝试过的:
# This seems to do the matching part well
filter(str_detect(location, paste(usa_data$Code, collapse = "|")))
# nested for loop
for (i in length(tweets$location)){
for (state in states){
if (grepl(state, tweets$location[i])){
tweets$country[i] = "USA"
break
}
}
}
期望的输出(基于示例输入):
tweets$country = data.frame(NA, "USA", NA, "USA")
我对 R 比较陌生,因此非常感谢任何帮助。
我们可以将 grepl
与 ifelse
一起用于基础 R 解决方案:
states = c("CA", "OH", "FL", "TX", "MN") # all the states
tweets$location = data.frame("my bed", "Minneapolis, MN", "Paris, France", "Los Angeles, CA")
regex <- paste0("\b(?:", paste(states, collapse="|"), ")\b")
tweets$country <- ifelse(grepl(regex, tweets$location), "USA", NA)
如果您更喜欢 dplyr
解决方案,但与 Tim 的回答非常相似
library(dplyr)
states <- c("CA", "OH", "FL", "TX", "MN") # all the states
tweets <- tibble(location = c(
"my bed", "Minneapolis, MN", "Paris, France",
"Los Angeles, CA"
))
tweets %>%
mutate(country = if_else(stringr::str_detect(
string = location,
pattern = paste0(
"\b(?:", paste(states,
collapse = "|"
),
")\b"
)
),
"United States", "NA"
))