在R中使用grepl来匹配字符串
Using grepl in R to match string
我有一帧数据"testData"如下:
id content
1 I came from China
2 I came from America
3 I came from Canada
4 I came from Japan
5 I came from Mars
而我还有另外一帧数据"addr"如下:
id addr
1 America
2 Canada
3 China
4 Japan
那么我如何使用 grepl
、sapply
或 R 中任何其他有用的函数将数据生成如下所示:
id content addr
1 I came from China China
2 I came from America America
3 I came from Canada Canada
4 I came from Japan Japan
5 I came from Mars Mars
看起来您只想复制列并删除 "I came from "
testData$addr <- gsub("I came from ", testData$content)
这样就可以了:
vec = addr$addr
testData$addr = apply(testData, 1, function(u){
bool = sapply(vec, function(x) grepl(x, u[['content']]))
if(any(bool)) vec[bool] else NA
})
这是使用一些 tidyverse
函数的粗略解决方案:
df1 <- read.table(text = "id content
1 'it is China'
2 'She is in America now'
3 'Canada is over there'
4 'He comes from Japan'
5 'I came from Mars'", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "id addr
1 America
2 Canada
3 China
4 Japan
5 Mars", header = TRUE, stringsAsFactors = FALSE)
library(tidyverse)
crossing(df1, df2 %>% select(addr)) %>% # this creates a data frame of every possible content and add combination
rowwise() %>%
filter(str_detect(content, add)) # str_detect is the same as grepl, though the arguments are reversed. This filters to only observations where addr is in content.
# A tibble: 5 x 3
id content addr
<int> <chr> <chr>
1 1 it is China China
2 2 She is in America now America
3 3 Canada is over there Canada
4 4 He comes from Japan Japan
5 5 I came from Mars Mars
我有一帧数据"testData"如下:
id content
1 I came from China
2 I came from America
3 I came from Canada
4 I came from Japan
5 I came from Mars
而我还有另外一帧数据"addr"如下:
id addr
1 America
2 Canada
3 China
4 Japan
那么我如何使用 grepl
、sapply
或 R 中任何其他有用的函数将数据生成如下所示:
id content addr
1 I came from China China
2 I came from America America
3 I came from Canada Canada
4 I came from Japan Japan
5 I came from Mars Mars
看起来您只想复制列并删除 "I came from "
testData$addr <- gsub("I came from ", testData$content)
这样就可以了:
vec = addr$addr
testData$addr = apply(testData, 1, function(u){
bool = sapply(vec, function(x) grepl(x, u[['content']]))
if(any(bool)) vec[bool] else NA
})
这是使用一些 tidyverse
函数的粗略解决方案:
df1 <- read.table(text = "id content
1 'it is China'
2 'She is in America now'
3 'Canada is over there'
4 'He comes from Japan'
5 'I came from Mars'", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "id addr
1 America
2 Canada
3 China
4 Japan
5 Mars", header = TRUE, stringsAsFactors = FALSE)
library(tidyverse)
crossing(df1, df2 %>% select(addr)) %>% # this creates a data frame of every possible content and add combination
rowwise() %>%
filter(str_detect(content, add)) # str_detect is the same as grepl, though the arguments are reversed. This filters to only observations where addr is in content.
# A tibble: 5 x 3
id content addr
<int> <chr> <chr>
1 1 it is China China
2 2 She is in America now America
3 3 Canada is over there Canada
4 4 He comes from Japan Japan
5 5 I came from Mars Mars