如何提取最长匹配？

Question

考虑这个简单的例子

library(stringr)
library(dplyr)

dataframe <- data_frame(text = c('how is the biggest ??',
                                 'really amazing stuff'))

# A tibble: 2 x 1
  text                 
  <chr>                
1 how is the biggest ??
2 really amazing stuff

我需要根据 regex 表达式提取一些术语，但只提取最长的术语。

到目前为止，我只能使用 str_extract.

提取第一个匹配项（不一定是最长的匹配项）

> dataframe %>% mutate(mymatch = str_extract(text, regex('\w+')))
# A tibble: 2 x 2
  text                  mymatch
  <chr>                 <chr>  
1 how is the biggest ?? how    
2 really amazing stuff  really

我尝试使用 str_extract_all 但我找不到有效的语法。输出应为：

# A tibble: 2 x 2
  text                  mymatch
  <chr>                 <chr>  
1 how is the biggest ?? biggest
2 really amazing stuff  amazing

有什么想法吗？谢谢！

Answer 1

你可以这样做：

library(stringr)
library(dplyr)

dataframe %>%
  mutate(mymatch = sapply(str_extract_all(text, '\w+'), 
                          function(x) x[nchar(x) == max(nchar(x))][1]))

与purrr:

library(purrr)

dataframe %>%
  mutate(mymatch = map_chr(str_extract_all(text, '\w+'), 
                           ~ .[nchar(.) == max(nchar(.))][1]))

结果：

# A tibble: 2 x 2
                   text mymatch
                  <chr>   <chr>
1 how is the biggest ?? biggest
2  really amazing stuff amazing

注：

如果有平局，则取第一个。

数据：

dataframe <- data_frame(text = c('how is the biggest ??',
                                 'really amazing biggest stuff'))

Answer 2

一个简单的方法是将过程分为 2 个步骤，首先是每行中所有单词的列表。然后从每个子列表中找到并 return 最长的单词：

df <- data_frame(text = c('how is the biggest ??',
                                 'really amazing stuff'))

library(stringr)
#create a list of all words per row
splits<-str_extract_all(df$text, '\w+', simplify = FALSE)
#find longest word and return it
sapply(splits, function(x) {x[which.max(nchar(x))]})

Answer 3

或者，使用 purrr...

library(dplyr)
library(purrr)
library(stringr)

dataframe %>% mutate(mymatch=map_chr(str_extract_all(text,"\w+"),
                                     ~.[which.max(nchar(.))]))

# A tibble: 2 x 2
  text                  mymatch
  <chr>                 <chr>  
1 how is the biggest ?? biggest
2 really amazing stuff  amazing

Answer 4

作为其他答案的变体，我建议编写一个执行操作的函数

longest_match <- function(x, pattern) {
    matches <- str_match_all(x, pattern)
    purrr::map_chr(matches, ~ .[which.max(nchar(.))])
}

那就用吧

dataframe %>%
    mutate(mymatch = longest_match(text, "\w+"))

通过评论，将执行新功能 longest_match() 的功能与 mutate() 启用的操作隔离开来似乎是更好的做法。例如，该功能易于测试，可以在其他情况下使用，并且可以独立于数据转换步骤进行修改（'return the last rather than first longest match'）。将所有内容都放在一行中没有实际价值，因此它使得编写逻辑上完成一件事的代码行是有意义的——找到所有匹配项，从所有匹配项映射到最长的，…… purrr::map_chr() 比 sapply() 更好，因为它更健壮——它保证结果是一个字符向量，所以类似于

> df1 = dataframe[FALSE,]
> df1 %>% mutate(mymatch = longest_match(text, "\w+"))
# A tibble: 0 x 2
# ... with 2 variables: text <chr>, mymatch <chr>

'does the right thing'，即 mymatch 是一个字符向量（在这种情况下，sapply() 会 return 一个列表）。

如何提取最长匹配？

How to extract the longest match?

regex

r

stringr

purrr