数据帧在 r 中的 for 循环中被覆盖

Question

我的数据集包含来自数据集的百万个观测值，我正在获取 10000 个观测值。这是数据集文件的 link：dataset file link

itemRatingData = itemRatingData[1:10000,]
#V2 is user ID, V1 is item ID, V3 is item rating from use

library(plyr)
countUser = count(itemRatingData, vars = "V2")
#counted the total obeservation per user in dataset

list_of_total_Users = as.list(countUser$V2)
#taking out total number of users as a list

我接下来要做的是提取那些至少对 10 个项目进行评分的用户观察，我成功地做到了。现在我有这样的用户，他们对 50、100 和 1000 多个项目进行了评分，但我只需要至少对 10 多个项目进行评分的用户进行 10 次观察。我做了想到的事情以获得预期的结果：

for (i in 1:length(list_of_total_Users)) {
    occurencePerID = subset(itemRatingData, 
    itemRatingData$V2%in%list_of_total_Users[[i]])

    countOccurencePerID = count(occurencePerID, vars = "V2")
    if(countOccurencePerID$freq >= 10){
       newItemRatingData = occurencePerID[1:10,]
    }
}

在这段代码中，我对每个用户 ID 的总观察结果进行了子集化，然后对它们进行了计数。如果用户 ID 频率 >= 10，则提取前 10 个观察值。现在我面临的问题是每次循环迭代都会覆盖 newItemRatingData。

Answer 1

即使没有数据我无法重现您的问题，但您似乎在每次迭代中都替换了 newItemRatingData 中的结果。如果您使用 cbind()，您可以将您的行附加到 newItemRatingData 而无需替换已经存在的行

newItemRatingData = data.frame()
for (i in 1:length(list_of_total_Users)) {
    occurencePerID = subset(itemRatingData, 
    itemRatingData$V2%in%list_of_total_Users[[i]])

    countOccurencePerID = count(occurencePerID, vars = "V2")
    if(countOccurencePerID$freq >= 10){
       newItemRatingData = cbind(newItemRatingData,occurencePerID[1:10,])
    }
}

Answer 2

我已经解决了我的问题，解决方案是：

newItemRatingData = data.frame("V2" = numeric(0), "V1" = numeric(0), "V3" = integer(0))

for (i in 1:length(list_of_total_Users)) {
  occurencePerID = subset(itemRatingData, itemRatingData$V2%in%list_of_total_Users[[i]])

  countOccurencePerID = count(occurencePerID, vars = "V2")
  if(countOccurencePerID$freq >= 10){
     newItemRatingData = rbind(newItemRatingData,occurencePerID[1:10,])  
 }
}

至于@fino 的回答是绑定列数据框。我发现按行绑定数据框的解决方案

数据帧在 r 中的 for 循环中被覆盖

dataframe overridden within for-loop in r

recommendation-engine

r

dataframe

data-science