这个迭代可以用整洁的功能方式编写吗

Question

data 有一个名为 description 的列 character() 和一个由 row_number() 设置的 integer() 类型的列 id .

data_map 具有 character() 类型的列名称 desc_map 和 row_number() 设置的 integer() 类型的列 id .

data 和 data_map 确实有其他列用于加入后的进一步处理。

下面代码的思路是使用data_map$desc_map作为str_detect中的模式来匹配data$description。在一场比赛中，它会使用 data$id 和 data_map$id 向另一个 tibble 添加一行。结果 matches 允许将 data 和 data_map.

连接在一起

library(tidyverse)

data = tribble(
  ~description,
  "19ABB123456",
  "19BCC123456",
  "19CDD123456",
  "19DEE123456",
  "19EFF456789",
  "19FF0056789",
  "19A0A123456",
) %>% mutate(id = row_number())

data_map = tribble(
  ~desc_map,
  "AA",
  "BB",
  "CC",
  "DD",
  "EE",
  "FF",
  "00",
) %>% mutate(id = row_number())

seq_along_rows <- function(.data) {
  seq_len(nrow(.data))
}

matches <- data %>% (function (tbl) {
  m <- tibble(
    row_id = integer(),
    map_id = integer()
  )

  for (i in seq_along_rows(tbl)) {
    row <- tbl[i, ]
    key <- row[["description"]]
    found <- FALSE

    for (j in seq_along_rows(data_map)) {
      map_row <- data_map[j, ]
      pattern <- map_row[["desc_map"]]

      if (str_detect(key, pattern)) {
        m <- add_row(m, row_id = row[["id"]], map_id = map_row[["id"]])
        found <- TRUE
        # allow for finding more than one match
      }
    }

    if (!found) {
      m <- add_row(m, row_id = row[["id"]], map_id = NA)
    }
  }

  return(m)
})

not_unique <- matches %>% 
  group_by(row_id) %>%
  filter(n() > 1) %>%
  ungroup() %>%
  inner_join(data, by = c("row_id" = "id")) %>%
  inner_join(data_map, by = c("map_id" = "id"))

head(not_unique)
#> # A tibble: 2 x 4
#>   row_id map_id description desc_map
#>    <int>  <int> <chr>       <chr>   
#> 1      6      6 19FF0056789 FF      
#> 2      6      7 19FF0056789 00

matches_not_found <- matches %>%
  filter(is.na(map_id)) %>%
  select(-map_id) %>%
  inner_join(data, by = c("row_id" = "id"))

head(matches_not_found)
#> # A tibble: 1 x 2
#>   row_id description
#>    <int> <chr>      
#> 1      7 19A0A123456

matches_found <- matches %>%
  filter(!is.na(map_id)) %>%
  inner_join(data, by = c("row_id" = "id")) %>%
  inner_join(data_map, by = c("map_id" = "id"))

head(matches_found)
#> # A tibble: 6 x 4
#>   row_id map_id description desc_map
#>    <int>  <int> <chr>       <chr>   
#> 1      1      2 19ABB123456 BB      
#> 2      2      3 19BCC123456 CC      
#> 3      3      4 19CDD123456 DD      
#> 4      4      5 19DEE123456 EE      
#> 5      5      6 19EFF456789 FF      
#> 6      6      6 19FF0056789 FF

我的问题是，这段代码可以用 tidy 更实用的方式编写吗？那会是什么样子？如果不能这样做，原因是什么？

Answer 1

更新

根据您更新的问题，这里是我的答案的更新版本。

这次我只是按原样使用了您的输入，并没有创建命名函数。相反，我把所有东西都放在一个管道里。 found 列应指示找到模式的次数，因此您不需要 not_unique、matched_not_found、matches_found.

等不同的对象

我从 GenesRus（在你的问题的评论中）获得了创建一个列表列并将其取消嵌套的想法，但我没有使用 spread/pivot-wider 进一步采用该方法，而是选择了 map2 循环在 description 和 desc_map 列上。

library(tidyverse)

data %>% 
  mutate(pattern = list(data_map)) %>% 
  unnest %>% 
  rename(row_id = "id", map_id = "id1") %>% 
  mutate(v = map2_lgl(description, desc_map,
                  ~ str_detect(.x, .y))) %>% 
  group_by(row_id) %>% 
  mutate(found = sum(v),
         desc_map = ifelse(found == F, NA, desc_map),
         map_id = ifelse(found == F, NA, map_id)) %>% 
  filter(v == T | (v == F & found == 0)) %>%
  distinct %>%
  select(-v)

旧答案

下面是一种更基于 tidyverse 的方法，应该会产生相同的结果。 'Should' 因为我只能猜测你的输入数据和预期结果是什么样的。一些注意事项： (1) 我选择普通字符向量作为输入。行 ID 是即时生成的。 (2) 我将你的方法放入一个名为 match_tbl 的函数中。 (3) 我将 tidyverse 函数与管道运算符结合使用。这使得整个方法易于阅读，外观似乎是 'tidyverse-ish'。但是，当您查看 tidyverse 包的实际功能时，您会发现作者通常避免在函数内部使用管道运算符，因为它很容易引发错误。在管道操作上使用 RStudio 调试器并尝试更深入地了解正在发生的事情，您会发现它非常混乱。因此，如果您想从中创建一个真正的 stable 函数，请删除管道并改用中间变量。

数据和包

library(tidyverse)

# some description data (not a dataframe but a normal char vector)
description <- c("This is a text description",
                "Some words that won't match",
                "Some random text goes here",
                "and some more explanation here")

# patterns that we want to find (not a dataframe but a normal char vector)
pattern <- c("explanation","description", "text")

生成所需输出的函数：匹配 table

# a function which replaces your nested for loop
match_tbl <- function(.string, .pattern) {

  res <- imap(.pattern,
               ~ stringr::str_detect(.string, .x) %>% 
                     tibble::enframe(name = "row_id") %>%
                     dplyr::mutate(map_id = .y) %>% 
                     dplyr::filter(value == T) %>% 
                     dplyr::select(-"value"))

  string_tbl <- .string %>% 
             tibble::enframe(name = "id") %>% 
             dplyr::select("id")

  dplyr::bind_rows(res) %>%
    dplyr::right_join(string_tbl, by = c("row_id" = "id"))

}

函数调用与输出

match_tbl(description, pattern)
>   row_id map_id
>    <int>  <int>
> 1      1      2
> 2      1      3
> 3      2     NA
> 4      3      3
> 5      4      1

这个迭代可以用整洁的功能方式编写吗

Can this iteration be written in a tidy functional way

r

stringr

dplyr

tidyverse

tibble