R - 删除 "neighborhood" 中的唯一行

Question

我有以下格式的输入数据

 stress word
 0      hello
 1      hello
 1      this
 1      is
 1      a
 1      normal
 0      normal
 1      test
 1      hello

我希望输出为

stress  word       stress_pos
 0      hello      2
 1      hello      2
 1      normal     1
 0      normal     1

数据集是一个很大的列表，其中的单词表示单词的重音位置——如果包含单词的第 k 行在第一列中为 1，则重音放在第 k 个音节上。单词可能出现在列表中的多个位置，所以我想删除 3 行范围内的非重复项（对于每一行，请查看上一行和下一行）。我只处理双音节词。这就是为什么我只看直接邻居。

我不能使用 duplicated() 或 unique()（或者我不知道如何使用），因为它们会处理整个 table 而不是其中的一小部分。

第三列表示重音在单词中的位置，可以从第一列推导出来。

有没有办法不使用循环？解决这个问题的好方法是什么？

Answer 1

首先，让我们考虑如何删除所有不与距离为 3 以内的另一个词重复的词。您可以确定每个单词是否与具有差异 d 的单词匹配：

matches <- function(words, d) {
  words <- as.character(words)
  if (d < 0) {
    words == c(rep("", -d), head(words, d))
  } else {
    words == c(tail(words, -d), rep("", d))
  }
}

然后您可以使用以下方法获取适当的数据行：

(out <- dat[rowSums(sapply(c(-1, 1), function(d) matches(dat$word, d))) > 0,])
#   stress   word
# 1      0  hello
# 2      1  hello
# 6      1 normal
# 7      0 normal

剩下的就是判断重读的音节了：

out$word <- as.character(out$word)
out$stress_pos <- ave(out$stress, out$word, FUN=function(x) min(which(x == 1)))
out
#   stress   word stress_pos
# 1      0  hello          2
# 2      1  hello          2
# 6      1 normal          1
# 7      0 normal          1

R - 删除 "neighborhood" 中的唯一行

R - Delete unique rows in "neighborhood"

r

unique

vectorization

duplicates