R 中的向量列表 - 提取向量的元素

Question

我有一个包含一些文本的列表。所以列表的每个元素都是一个文本。文本是单词的向量。所以我有一个向量列表。我正在对此进行一些文本挖掘。现在，我正在尝试提取单词 "no" 之后的单词。我转换了我的向量，所以现在它们是两个词的向量。如： list(c("want friend", "friend funny", "funny nice", "nice glad", "glad become", "become no", "no more", "more guys"), c("no comfort", "comfort written", "written conduct","conduct prevent", "prevent manners", "matters no", "no one", "one want", "want be", "be fired"))

我的目标是得到一个向量列表，如下所示： list(c("more"), c("comfort", "one")) 所以我将能够看到文本 i liste[i].

结果的向量

所以我有一个公式可以提取 "no" 之后的单词（在第一个向量中它将是 "more"）。但是当我的文本中有几个 "no" 时，它不起作用。

这是我的代码：

liste_negation <- vector(length = length(data))
for (i in 1:length(data)){
  for (j in 1:length(data[[i]])){
    if (startsWith((data[[i]])[[j]], 'no') == TRUE){
      liste_neg[i] <- c(liste_neg[i], tail(strsplit((data[[i]])[[j]],split=" ")[[1]],1))
    } else{
      liste_neg[i] <- c(liste_neg[i])
    }
    liste_negation[[i]] <- c(liste_neg[[i]])
  }
}

当只有一个向量时，它适用于一个向量 "no" :

data <- list(c("want friend", "friend funny", "funny nice", "nice glad", "glad become", "become no", "no more", "more guys"), c("no comfort", "comfort written", "written conduct","conduct prevent", "prevent manners", "matters no", "no one", "one want", "want be", "be fired"))
data

liste_neg <- c()
liste_negation <- vector(length = length(data))
if (startsWith((data[[1]])[[9]], 'no') == TRUE){
  liste_neg[1] <- c(liste_neg[1], tail(strsplit((data[[1]])[[9]],split=" ")[[1]],1))
}

liste_negation[[1]] <- c(liste_neg[[1]])

但是如果我尝试用一个循环来修改它以查看向量的每个元素，而文本中有多个 "no"，它就不起作用了。

代码：

liste_neg <- c()
liste_negation <- vector(length = length(data))
for (j in 1:length(data[[2]])){
  if (startsWith((data[[2]])[[j]], 'no') == TRUE){
    liste_neg[2] <- append(liste_neg[2], tail(strsplit((data[[2]])[[j]],split=" ")[[1]],1))
  }
}
liste_neg
liste_negation[[2]] <- c(liste_neg[[2]])
liste_negation

警告消息：

Warning message:
In liste_neg[2] <- append(liste_neg[2], tail(strsplit((data[[2]])[[j]],  :
  number of items to replace is not a multiple of replacement length
> liste_neg
[1] NA        "comfort"
> liste_negation[[2]] <- c(liste_neg[[2]])
> liste_negation
[1] "FALSE"   "comfort"

如你所见，我只有第二个字。

我尝试了很多东西，我尝试拆分代码和运行它并逐个处理它，但是在花了整个上午之后我还没有找到解决方案..

有人有什么想法可以帮助我吗？

提前谢谢你（抱歉我的英语不好，我是法语^^'）

Answer 1

在 base R 中，我们可以使用 sapply 循环列表并使用 grep 识别带有 "no"

的单词

output <- sapply(word_vec, function(x) sub(".*no", "", grep("\bno\b", x, value = TRUE)))

#[[1]]
#[1] ""      " more"

#[[2]]
#[1] " comfort" ""         " one"

如果您不需要空字符串，您可以删除它们以获得

sapply(output, function(x) trimws(x[x!= ""]))  
#[[1]]
#[1] "more"

#[[2]]
#[1] "comfort" "one"

Answer 2

lapply(data, function(x) substr(x[startsWith(x, "no")], 4, 1000))


[[1]]
[1] "more"

[[2]]
[1] "comfort" "one"

Answer 3

您可以使用带捕获组的正则表达式来获取与所需模式匹配的所有子字符串，然后仅提取捕获的组，如下所示：

# regex for strings that start with "no " and have any text after that
r <- '^no (.*)'
lapply(data, function(x) gsub(r, '\1', regmatches(x, regexpr(r, x))))

#output
[[1]]
[1] "more"

[[2]]
[1] "comfort" "one"

regexpr returns regmatches 将从中提取匹配字符串的匹配对象，gsub 使用 \1 参数提取第一个捕获的字符串组.

Answer 4

提取"no"后单词的步骤：

首先，使用grep(i,pattern = "^no",value = T)获取以"no"开头的文本。
gsub(pattern = "no ",replacement = "") 将 "no " 替换为 "" .

那么就可以提取"no"后面的词了。

lapply() 可以拆分列表并将步骤应用于列表的元素。
%>% pipe operator可以让代码清晰，把grep()的结果带入gsub().

library(magrittr)   
lapply(data,function(i)grep(i,pattern = "^no",value = T) %>% gsub(pattern = "no ",replacement = ""))
#[[1]]
#[1] "more"
#    
#[[2]]
#[1] "comfort" "one"

R 中的向量列表 - 提取向量的元素

list of vectors in R - extract an element of the vectors

r

list

vector

text-mining