Return 最多前三个字

Question

试图找到 return R 中前三个单词的方法。我尝试了 string_r 中的单词函数，但它只 return 是句子的前三个单词至少有三个词。例如，


sentences <- c("Jane saw a cat", "Jane sat down", "Jane sat", "Jane")

word(sentences, 1, 3)

这个returns Jane saw a, Jane sat down, NA, NA

我想return前三个词，即使句子只有一两个词。所以我正在寻找的输出是：

这个returns Jane saw a, Jane sat down, Jane Sat, Jane

Answer 1

我们可以拆分得到单词

sapply(strsplit(sentences, " "), \(x) paste(head(x, 3), collapse=" "))

-输出

[1] "Jane saw a"    "Jane sat down" "Jane sat"      "Jane"

或使用正则表达式

trimws( sub("^((\w+\s+){1,3}).*", "\1", sentences))

-输出

[1] "Jane saw a" "Jane sat"   "Jane"       "Jane"

如果我们要使用word，那么可能需要一个coalesce

library(stringr)
library(purrr)
library(dplyr)
map(3:1,  word, string = sentences, start = 1) %>%
    exec(coalesce, !!!.)
[1] "Jane saw a"    "Jane sat down" "Jane sat"      "Jane"

Answer 2

1) stringr 计算输入的每个组件中的单词数，并使用该值或 3，以较小者为准，作为 return 的单词数.

library(stringr)
word(sentences, end = pmin(str_count(sentences, "\w+"), 3))
## [1] "Jane saw a"    "Jane sat down" "Jane sat"      "Jane"

2) stringr 解决方案 2 在末尾附加一些虚拟词，去掉前 3 个词和 trim 剩下的所有虚拟词。

sentences %>%
  str_c("@ @ @") %>%
  word(end = 3) %>%
  str_replace(" *@.*", "")
## [1] "Jane saw a"    "Jane sat down" "Jane sat"      "Jane"

3a) Base R 与 (1) 相同的想法可以这样翻译成 base R:

Word <- function(x, end) do.call("paste", read.table(text = x, fill = TRUE)[1:end])

unname(Vectorize(Word)(sentences, end = pmin(lengths(strsplit(sentences, " ")), 3)))
## [1] "Jane saw a"    "Jane sat down" "Jane sat"      "Jane"

3b) 和(2)一样的思路可以这样翻译成base R。 Word 来自 (3a).

sentences |>
  paste("@ @ @") |>
  Word(end = 3) |>
  sub(pattern = " *@.*", replacement = "")
## [1] "Jane saw a"    "Jane sat down" "Jane sat"      "Jane"

更新

(1) 已简化，旧的 (1) 现在是 (2)。 (3a) 和 (3b) 现在是 Base R 对应项。

Return 最多前三个字

Return up to the first three words

r

stringr

tidyverse

更新