R 中十个矩阵的重复动作

Question

我有十个数据集，每个数据集包含 "ratings" 和 "occupation" 列。我想从这十个数据集中的每一个中找出每个三个职业组（即艺术家、技术人员、市场营销）的 "average" 的 "ratings"。

我写的代码如下：

Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
                                    occupation=as.character(), 
                                    average.rating=as.numeric(), 
                                    stringsAsFactors=FALSE) 
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)

Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)

e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval, 
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))


e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval, 
c(1,"technician",mean(e.1.technician$rating)))


e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval, 
c(1,"marketing",mean(e.1.marketing$rating)))

这显然根本没有效率，因为对于十个数据集，我必须将相同的代码再重写 9 次才能获得我所有十个数据集的每个职业组的平均评分。有一个更好的方法吗？我想不出更好的了！我发现 apply/lapply 可以做到这一点，但我无法弄清楚它们如何适用于我的情况。

Two of my datasets (e1 and e2) can be found here.（我只包含了每个观察结果的 10%）

Answer 1

您可以使用 tidyverse 包来汇总每个数据框。首先，您需要将它们放在一个列表中。然后你可以遍历列表中的每个数据框，按职业总结：

library(tidyverse)

# Create sample data
set.seed(2353)

sample_data <- rerun(10, tibble(
  occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
  ratings    = sample(1:100, 100, replace = TRUE)
))

# Summarize by occupation
summarized_data <- sample_data %>% 
  map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))

Answer 2

另一种选择，带底座。首先将文件加载到列表中，然后使用 lapply 计算每个数据集的均值

# Set directory to a file that contains the files
files <- list.files()

# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)

# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))

# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])

                 d1       d2
artist     3.540984 3.612048
technician 3.519512 3.651106
marketing  3.147208 3.342569

请注意，如果您的数据已经全部加载，您可以将它们放入列表中（而不是直接将所有数据加载到列表中），然后使用 lapply 函数，它应该仍然有效.

编辑

我刚刚意识到你只想要三个组的方法。我已经编辑了上面的代码以将所有方法子集化为仅三个组。

Answer 3

我推荐使用 "plyr" 包进行这种操作；花一个小时左右的时间来学习是非常值得的。在你的例子中，我在 "d1" 中加载了你的第一个示例数据集，我可以这样总结它：

ddply(d1, .(occupation), summarise, mean_rating=mean(rating))

这显示了所有个职业的结果，而您只想要特定的三个，因此我们可以将其筛选为那些：

ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))

现在我们只需要将其概括为运行宁超过 10 个数据集而无需剪切和粘贴。让我们将数据框存储在一个列表中：

dataset_list <- list(d1=d1) # you would put all of them here; I just have one

现在我们可以运行使用相同的代码 lapply，然后返回一个列表：

filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
    ddply(subset(dataset,occupation %in% filtered_occupations), 
    .(occupation), summarise, mean_rating=mean(rating))} )

结果：

$d1
  occupation mean_rating
1     artist    3.540984
2  marketing    3.147208
3 technician    3.519512

R 中十个矩阵的重复动作

Repetitive Action Over Ten Matrices in R

r

matrix

apply

编辑