Grepl 并在 R 中提取匹配项
Grepl and Extract the Match in R
在 R 中我有:
library(tidyverse)
full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"),
TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs
abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Pepsi, "Microsoft", "Tesla") # a vector with abbreviated names of firms
我想检查缩写名称是否在全名数据集中,如果为真,则随后将 full_names 行匹配到 abbr_names 向量,例如:
[1] [2] [3] [4]
[1] Abbott ABBOTT LABORATORIES ABT 555
[2] Apple APPLE INC. AAPL 111
[3] Microsoft MICROSOFT CORPORATION MSFT 222
[4] Tesla TESLA INC. TSLA 444
尝试了几个 str_extract 和 grepl 函数,但仍然无法正常工作。
matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))
这将为您提供一个向量,其中名称作为缩写,公司作为值
names(matches)
# [1] "ABBOTT" "APPLE" "MICROSOFT" "TESLA"
c(firm_matches, use.names = FALSE)
# [1] "ABBOTT LABORATORIES" "APPLE INC." "MICROSOFT CORPORATION" "TESLA INC."
有多种方法可以将其组合在一起...拼凑...
从@Oscar的评论中,我们总共两行代码得到了想要的输出:
matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))
tibble(ABBR_FIRM = names(matches), FIRM = matches) %>% left_join(., full_names, by = "FIRM")
这个怎么样?
full_names$row_num <- 1:nrow(full_names)
do.call(rbind,
lapply(abbr_names,
function(x){
if(sum(grepl(x, full_names$FIRM, ignore.case = TRUE)) > 0){
row <- grepl(x, full_names$FIRM, ignore.case = TRUE) %>%
which()} else {row <- 0}
data.frame("name" = x,
"row_num" = row)})) %>%
right_join(full_names, by = "row_num")
另一个选项可能是例如这个...
map_int(abbr_names, ~ {
idx <- grep(., full_names$FIRM, ignore.case = TRUE)
if (length(idx) == 0) return(NA) else return(idx)
}) %>%
cbind(ABBR = abbr_names, FIRM = full_names$FIRM[.]) %>%
as.tibble() %>%
left_join(full_names, by = "FIRM") %>%
complete(FIRM)
# A tibble: 4 x 5
FIRM . ABBR TICKER ID
<chr> <chr> <chr> <chr> <dbl>
1 ABBOTT LABORATORIES 5 Abbott ABT 555
2 APPLE INC. 1 Apple AAPL 111
3 MICROSOFT CORPORATION 2 Microsoft MSFT 222
4 TESLA INC. 4 Tesla TSLA 444
只是想还post而已:)
我的建议是,将所有单词的大写或小写设置为大写或小写。更容易与grepl
的功能进行比较。
我的代码:
library(tidyverse)
full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"),
TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs
abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Microsoft", "Tesla") # a vector with abbreviated names of firms
我在这里创建了一个新列,我们要索引 grepl
的 return 的列
full_names$new_column <- NA
然后,我在我们想要在数据框中索引的名称中做了一个循环
for(i in 1:length(abbr_names)){
search_test <- grepl(tolower(substr(abbr_names[i], 0,4)), tolower(full_names$FIRM))
position <- grep("TRUE", search_test)
full_names$new_column[position] <- abbr_names[i]
}
结果是以下数据框:
FIRM TICKER ID new_column
1 APPLE INC. AAPL 111 Apple
2 MICROSOFT CORPORATION MSFT 222 Microsoft
3 GOOGLE GOOGL 333 NA
4 TESLA INC. TSLA 444 Tesla
5 ABBOTT LABORATORIES ABT 555 Abbott
"GOOG" 不在 abbr_names 向量中,所以 return 是 NA
在 R 中我有:
library(tidyverse)
full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"),
TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs
abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Pepsi, "Microsoft", "Tesla") # a vector with abbreviated names of firms
我想检查缩写名称是否在全名数据集中,如果为真,则随后将 full_names 行匹配到 abbr_names 向量,例如:
[1] [2] [3] [4]
[1] Abbott ABBOTT LABORATORIES ABT 555
[2] Apple APPLE INC. AAPL 111
[3] Microsoft MICROSOFT CORPORATION MSFT 222
[4] Tesla TESLA INC. TSLA 444
尝试了几个 str_extract 和 grepl 函数,但仍然无法正常工作。
matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))
这将为您提供一个向量,其中名称作为缩写,公司作为值
names(matches)
# [1] "ABBOTT" "APPLE" "MICROSOFT" "TESLA"
c(firm_matches, use.names = FALSE)
# [1] "ABBOTT LABORATORIES" "APPLE INC." "MICROSOFT CORPORATION" "TESLA INC."
有多种方法可以将其组合在一起...拼凑...
从@Oscar的评论中,我们总共两行代码得到了想要的输出:
matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))
tibble(ABBR_FIRM = names(matches), FIRM = matches) %>% left_join(., full_names, by = "FIRM")
这个怎么样?
full_names$row_num <- 1:nrow(full_names)
do.call(rbind,
lapply(abbr_names,
function(x){
if(sum(grepl(x, full_names$FIRM, ignore.case = TRUE)) > 0){
row <- grepl(x, full_names$FIRM, ignore.case = TRUE) %>%
which()} else {row <- 0}
data.frame("name" = x,
"row_num" = row)})) %>%
right_join(full_names, by = "row_num")
另一个选项可能是例如这个...
map_int(abbr_names, ~ {
idx <- grep(., full_names$FIRM, ignore.case = TRUE)
if (length(idx) == 0) return(NA) else return(idx)
}) %>%
cbind(ABBR = abbr_names, FIRM = full_names$FIRM[.]) %>%
as.tibble() %>%
left_join(full_names, by = "FIRM") %>%
complete(FIRM)
# A tibble: 4 x 5
FIRM . ABBR TICKER ID
<chr> <chr> <chr> <chr> <dbl>
1 ABBOTT LABORATORIES 5 Abbott ABT 555
2 APPLE INC. 1 Apple AAPL 111
3 MICROSOFT CORPORATION 2 Microsoft MSFT 222
4 TESLA INC. 4 Tesla TSLA 444
只是想还post而已:)
我的建议是,将所有单词的大写或小写设置为大写或小写。更容易与grepl
的功能进行比较。
我的代码:
library(tidyverse)
full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"),
TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs
abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Microsoft", "Tesla") # a vector with abbreviated names of firms
我在这里创建了一个新列,我们要索引 grepl
full_names$new_column <- NA
然后,我在我们想要在数据框中索引的名称中做了一个循环
for(i in 1:length(abbr_names)){
search_test <- grepl(tolower(substr(abbr_names[i], 0,4)), tolower(full_names$FIRM))
position <- grep("TRUE", search_test)
full_names$new_column[position] <- abbr_names[i]
}
结果是以下数据框:
FIRM TICKER ID new_column
1 APPLE INC. AAPL 111 Apple
2 MICROSOFT CORPORATION MSFT 222 Microsoft
3 GOOGLE GOOGL 333 NA
4 TESLA INC. TSLA 444 Tesla
5 ABBOTT LABORATORIES ABT 555 Abbott
"GOOG" 不在 abbr_names 向量中,所以 return 是 NA