如何在 R 中的重复字符串中选择最长的 ngram?

How to choose the longest ngram in a repetitive string in R?

我有一个类似于下图的数据集(只是行数更多):

x = c("abov level", "abov level consist", "abov level consist price", 
"abov level consist price stabil", "abov level consist price stabil protract", 
"abov level consist price stabil protract period", "abov level consist price stabil protract period time", 
"abov level consist price stabil sinc", "abov level consist price stabil sinc last", 
"abov level consist price stabil sinc last autumn", "abov level consist price stabil some", 
"abov level consist price stabil some time", "abov over", "abov over come", 
"abov over come month", "abov precis", "abov precis level", "abov precis level depend", 
"abov precis level depend futur", "abov precis level depend futur energi", 
"abov precis level depend futur energi price", "abov precis level depend futur energi price develop"
)

 [1] "abov level"                                          
 [2] "abov level consist"                                  
 [3] "abov level consist price"                            
 [4] "abov level consist price stabil"                     
 [5] "abov level consist price stabil protract"            
 [6] "abov level consist price stabil protract period"     
 [7] "abov level consist price stabil protract period time"
 [8] "abov level consist price stabil sinc"                
 [9] "abov level consist price stabil sinc last"           
[10] "abov level consist price stabil sinc last autumn"    
[11] "abov level consist price stabil some"                
[12] "abov level consist price stabil some time"           
[13] "abov over"                                           
[14] "abov over come"                                      
[15] "abov over come month"                                
[16] "abov precis"                                         
[17] "abov precis level"                                   
[18] "abov precis level depend"                            
[19] "abov precis level depend futur"                      
[20] "abov precis level depend futur energi"               
[21] "abov precis level depend futur energi price"         
[22] "abov precis level depend futur energi price develop"

如您所见,有一个清晰的模式:在更改基数并再次重新启动该过程之前,一次将一个单词添加到先前的 ngram 中。我以第一个“区块”为例:

 [1] "abov level"                                          
 [2] "abov level consist"                                  
 [3] "abov level consist price"                            
 [4] "abov level consist price stabil"                     
 [5] "abov level consist price stabil protract"            
 [6] "abov level consist price stabil protract period"     
 [7] "abov level consist price stabil protract period time"

对于上面的每个“块”,我只会保留最长的 sentence/ngram。在上述情况下,我只会保留第 7 行。对每个块都这样做,我会得到:

    
 [1] "abov level consist price stabil protract period time"           
 [2] "abov level consist price stabil sinc last autumn"    
 [3] "abov level consist price stabil some time"                                              
 [4] "abov over come month"                                      
 [5] "abov precis level depend futur energi price develop"

有人可以帮我做吗?

谢谢!

您可以计算每个字符串中的字符数以及select字符少于前一个字符串的值。

inds <- c(which(diff(nchar(x)) < 0), length(x))
x[inds]

#[1] "abov level consist price stabil protract period time"
#[2] "abov level consist price stabil sinc last autumn"    
#[3] "abov level consist price stabil some time"           
#[4] "abov over come month"                                
#[5] "abov precis level depend futur energi price develop" 

我们可以在 dplyr 中使用 filterlead

library(dplyr)
tibble(x) %>%
     filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)