R 中十个矩阵的重复动作

Repetitive Action Over Ten Matrices in R

我有十个数据集,每个数据集包含 "ratings" 和 "occupation" 列。我想从这十个数据集中的每一个中找出每个三个职业组(即艺术家、技术人员、市场营销)的 "average" 的 "ratings"。

我写的代码如下:

Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
                                    occupation=as.character(), 
                                    average.rating=as.numeric(), 
                                    stringsAsFactors=FALSE) 
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)

Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)

e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval, 
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))


e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval, 
c(1,"technician",mean(e.1.technician$rating)))


e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval, 
c(1,"marketing",mean(e.1.marketing$rating)))

这显然根本没有效率,因为对于十个数据集,我必须将相同的代码再重写 9 次才能获得我所有十个数据集的每个职业组的平均评分。有一个更好的方法吗?我想不出更好的了!我发现 apply/lapply 可以做到这一点,但我无法弄清楚它们如何适用于我的情况。

Two of my datasets (e1 and e2) can be found here.(我只包含了每个观察结果的 10%)

您可以使用 tidyverse 包来汇总每个数据框。首先,您需要将它们放在一个列表中。然后你可以遍历列表中的每个数据框,按职业总结:

library(tidyverse)

# Create sample data
set.seed(2353)

sample_data <- rerun(10, tibble(
  occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
  ratings    = sample(1:100, 100, replace = TRUE)
))

# Summarize by occupation
summarized_data <- sample_data %>% 
  map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))

另一种选择,带底座。首先将文件加载到列表中,然后使用 lapply 计算每个数据集的均值

# Set directory to a file that contains the files
files <- list.files()

# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)

# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))

# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])

                 d1       d2
artist     3.540984 3.612048
technician 3.519512 3.651106
marketing  3.147208 3.342569

请注意,如果您的数据已经全部加载,您可以将它们放入列表中(而不是直接将所有数据加载到列表中),然后使用 lapply 函数,它应该仍然有效.

编辑

我刚刚意识到你只想要三个组的方法。我已经编辑了上面的代码以将所有方法子集化为仅三个组。

我推荐使用 "plyr" 包进行这种操作;花一个小时左右的时间来学习是非常值得的。在你的例子中,我在 "d1" 中加载了你的第一个示例数据集,我可以这样总结它:

ddply(d1, .(occupation), summarise, mean_rating=mean(rating))

这显示了 所有 个职业的结果,而您只想要特定的三个,因此我们可以将其筛选为那些:

ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))

现在我们只需要将其概括为 运行宁超过 10 个数据集而无需剪切和粘贴。让我们将数据框存储在一个列表中:

dataset_list <- list(d1=d1) # you would put all of them here; I just have one

现在我们可以 运行 使用相同的代码 lapply,然后返回一个列表:

filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
    ddply(subset(dataset,occupation %in% filtered_occupations), 
    .(occupation), summarise, mean_rating=mean(rating))} )

结果:

$d1
  occupation mean_rating
1     artist    3.540984
2  marketing    3.147208
3 technician    3.519512