需要帮助将我的简单循环代码更新为使用应用 (R) 的更快代码

Question

我有两个数据集：

包含国家名称的 10*1 矩阵：

countries<-structure(
  c("usa", "canada", "france", "england", "brazil",
    "spain", "germany", "italy", "belgium", "switzerland"),
  .Dim = c(10L,1L))

还有一个 20*2 的矩阵，其中包含 3-gram 和这些 3-gram 的 ID：

tri_grams<-    structure(
  c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
    "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
    "mo", "an", "ce", "ko", "we", "ge", "ma", "fi", "br", "ca",
    "gi", "po", "ro", "ch", "ru", "tz", "il", "sp", "ai", "jo"), 
  .Dim = c(20L,2L),
  .Dimnames = list(NULL, c("id", "triGram")))

我想循环国家/地区并为每一行获取该国家/地区存在的 tri_grams。例如在巴西有 "br" 和 "il"。我想获取信息：（国家索引（双），三元组 ID（字符））。因此，对于巴西，我想得到：(5,"49") 和 (5,"25")。

这是带有简单循环的代码：

res <- matrix(ncol=2,nrow=nrow(countries)*nrow(tri_grams))
colnames(res) <- c("indexCountry","idTriGram")
k <- 0

for(i in 1:nrow(countries))
{
  for(j in 1:nrow(tri_grams))
  {
    if(grepl(tri_grams[j,2],countries[i,1])==TRUE)
    {
      k <- k+1
      res[k,1] <- i
      res[k,2] <- tri_grams[j,1]
    }
  }
}
res <- res[1:k,]

它运行完美，结果如下：

     indexCountry idTriGram
 [1,] "2"          "2"      
 [2,] "2"          "10"     
 [3,] "3"          "2"      
 [4,] "3"          "3"      
 [5,] "4"          "2"      
 [6,] "5"          "9"      
 [7,] "5"          "17"     
 [8,] "6"          "18"     
 [9,] "6"          "19"     
[10,] "7"          "2"      
[11,] "7"          "6"      
[12,] "7"          "7"      
[13,] "9"          "11"     
[14,] "10"         "2"      
[15,] "10"         "16"

我想得到相同的结果，但使用应用。我实际上有一个巨大的数据集，这只是我真实数据集的一个样本。当我在我的真实数据集上使用简单循环方法时，它需要很长时间运行（超过 10 小时）。我尝试使用 apply 对其进行编码，但没有成功。

Answer 1

我不知道这到底有多快，但这里至少提供了一种获得相同结果的简洁方法。

x<-which(outer(tri_grams[,"triGram"],countries,Vectorize(grepl))[,,1],arr.ind=TRUE)
cbind(country=x[,2],trigram=x[,1])

     country trigram
 [1,]       2       2
 [2,]       2      10
 [3,]       3       2
 [4,]       3       3
 [5,]       4       2
 [6,]       5       9
 [7,]       5      17
 [8,]       6      18
 [9,]       6      19
[10,]       7       2
[11,]       7       6
[12,]       7       7
[13,]       9      11
[14,]      10       2
[15,]      10      16

需要帮助将我的简单循环代码更新为使用应用 (R) 的更快代码

Need help using updating my simple loop code to a faster code that uses apply (R)

r

vectorization

apply