R：从 data.table 中的字符列中提取最后 N 个词

Question

我希望得到一些帮助，从 data.table.. 中的一列中提取最后 N 个词，然后将其分配给一个新列。

 test <- data.table(original = c('the green shirt totally brings out your eyes'
                               , 'ford focus hatchback'))

原来的data.table是这样的：

original
1: the green shirt totally brings out your eyes
2: ford focus hatchback

我想将（最多）最后 5 个单词子集化到一个新列中，所以输出看起来像:

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         ford focus hatchback

我试过了：

  test <- test[, extracted := paste0(tail(strsplit(original, ' ')[[1]], 5)
                                   , collapse = ' ')]

它几乎可以工作，除了 'extracted' 列中的第一个值在整个新列中重复：

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         totally brings out your eyes

我这辈子都想不通。我尝试了 'stringr' 中的 'word' 函数，它给了我最后一个词，但我似乎无法倒数。

如有任何帮助，我们将不胜感激！

Answer 1

基础 R 解决方案：

test[,extracted:=sapply(strsplit(original,'\s+'),function(v) paste(collapse=' ',tail(v,5L)))];
##                                        original                    extracted
## 1: the green shirt totally brings out your eyes totally brings out your eyes
## 2:                         ford focus hatchback         ford focus hatchback

Answer 2

我可能会使用

n = 5
patt = sprintf("\w+( \w+){0,%d}$", n-1)

library(stringi)
test[, ext := stri_extract(original, regex = patt)]

                                       original                          ext
1: the green shirt totally brings out your eyes totally brings out your eyes
2:                         ford focus hatchback         ford focus hatchback

评论：

如果您设置 n=0，这会中断，但可能没有充分的理由这样做。
这是矢量化的，以防你有 n 不同的行（例如，n=3:4）。

@eddi 提供了一个基础类似物（固定n）：

test[, ext := sub('.*?(\w+( \w+){4})$', '\1', original)]

R：从 data.table 中的字符列中提取最后 N 个词

R: Extract last N words from character column in data.table

r

stringr

data.table