基于R中的部分字符串匹配(在另一个数据帧中)替换数据帧中的NA

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

目标:根据另一个数据框中的“键”更改一个数据框中的一列 NA(类似 VLookUp,但仅在 R 中除外)

这里给定 df1(为了简单起见,我只有 6 行。我的关键是 50 行代表 50 个州):

Index State_Name Abbreviation
1 California CA
2 Maryland MD
3 New York NY
4 Texas TX
5 Virginia VA
6 Washington WA

并在此处给出 df2(这只是一个示例。我正在使用的真实数据框有更多行):

Index State Article
1 NA Texas governor, Abbott, signs new abortion bill
2 NA Effort to recall California governor Newsome loses steam
3 NA New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4 NA Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5 NA DC statehood unlikely as Manchin opposes
6 NA Amazon HQ2 causing housing prices to soar in northern Virginia

任务:创建一个循环并读取每个 df2$Article 行中的状态的 R 函数;然后将其与 df1$State_Name 交叉引用,以根据 df2$Article 中的状态将 df2$State 中的 NA 替换为相应的 df1$Abbreviation 键。我知道这太难听了。我对如何开始和完成这个难题感到困惑。硬编码不是一种选择,因为我有数千行这样的真实数据表,并且会随着我们向 text-scrape 添加更多文章而更新。

输出应如下所示:

Index State Article
1 TX Texas governor, Abbott, signs new abortion bill
2 CA Effort to recall California governor Newsome loses steam
3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5 NA DC statehood unlikely as Manchin opposes
6 VA Amazon HQ2 causing housing prices to soar in northern Virginia

注意:第 5 个带 DC 的条目是 NA。

任何指向指南的链接,and/or任何关于如何编码的建议都非常感谢。谢谢!

您可以从 State_Name 创建一个正则表达式模式,然后使用 str_extractArticle 中提取它。使用matchdf1.

得到对应的Abbreviation名字
library(stringr)

df2$State <- df1$Abbreviation[match(str_extract(df2$Article, 
               str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA   "VA"

您还可以使用内置 state.namestate.abb 而不是 df1 来获取州名和缩写。


这是在 for 循环中执行此操作的方法 -

for(i in seq(nrow(df1))) {
  inds <- grep(df1$State_Name[i], df2$Article)
  if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2

#  Index State                                                                      Article
#1     1    TX                              Texas governor, Abbott, signs new abortion bill
#2     2    CA                     Effort to recall California governor Newsome loses steam
#3     3    NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4     4    MD     Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5     5  <NA>                                     DC statehood unlikely as Manchin opposes
#6     6    VA               Amazon HQ2 causing housing prices to soar in northern Virginia

不像上面那样简洁,而是一种 Base R 方法:

# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
  # Coerce 0 length vectors to na values of the appropriate type: 
  # .zero_to_nas => function()
  .zero_to_nas <- function(x){
    if(identical(x, character(0))){
      NA_character_ 
    }else if(identical(x, integer(0))){
      NA_integer_
    }else if(identical(x, numeric(0))){
      NA_real_
    }else if(identical(x, complex(0))){
      NA_complex_
    }else if(identical(x, logical(0))){
      NA
    }else{
      x
    }
  }
  # Unlist cleaned list: res => vector
  res <- unlist(lapply(lst, .zero_to_nas))
  # Explictly define return object: vector => GlobalEnv()
  return(res)
}

# Classify each article as belonging to the appropriate state: 
# clean_df => data.frame
clean_df <- transform(
  df2,
  State = df1$Abbreviation[
    match(
      list_2_vec(
        regmatches(
          Article, 
          gregexpr(
            paste0(df1$State_Name, collapse = "|"), Article
          )
        )
      ),
      df1$State_Name
    )
  ]
)

# Data: 
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland", 
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA", 
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA), 
Article = c("Texas governor, Abbott, signs new abortion bill", 
"Effort to recall California governor Newsome loses steam", 
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data", 
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions", 
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))