R 中十个矩阵的重复动作
Repetitive Action Over Ten Matrices in R
我有十个数据集,每个数据集包含 "ratings" 和 "occupation" 列。我想从这十个数据集中的每一个中找出每个三个职业组(即艺术家、技术人员、市场营销)的 "average" 的 "ratings"。
我写的代码如下:
Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
occupation=as.character(),
average.rating=as.numeric(),
stringsAsFactors=FALSE)
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)
Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)
e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))
e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating)))
e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"marketing",mean(e.1.marketing$rating)))
这显然根本没有效率,因为对于十个数据集,我必须将相同的代码再重写 9 次才能获得我所有十个数据集的每个职业组的平均评分。有一个更好的方法吗?我想不出更好的了!我发现 apply/lapply 可以做到这一点,但我无法弄清楚它们如何适用于我的情况。
Two of my datasets (e1 and e2) can be found here.(我只包含了每个观察结果的 10%)
您可以使用 tidyverse
包来汇总每个数据框。首先,您需要将它们放在一个列表中。然后你可以遍历列表中的每个数据框,按职业总结:
library(tidyverse)
# Create sample data
set.seed(2353)
sample_data <- rerun(10, tibble(
occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
ratings = sample(1:100, 100, replace = TRUE)
))
# Summarize by occupation
summarized_data <- sample_data %>%
map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))
另一种选择,带底座。首先将文件加载到列表中,然后使用 lapply
计算每个数据集的均值
# Set directory to a file that contains the files
files <- list.files()
# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)
# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))
# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])
d1 d2
artist 3.540984 3.612048
technician 3.519512 3.651106
marketing 3.147208 3.342569
请注意,如果您的数据已经全部加载,您可以将它们放入列表中(而不是直接将所有数据加载到列表中),然后使用 lapply
函数,它应该仍然有效.
编辑
我刚刚意识到你只想要三个组的方法。我已经编辑了上面的代码以将所有方法子集化为仅三个组。
我推荐使用 "plyr" 包进行这种操作;花一个小时左右的时间来学习是非常值得的。在你的例子中,我在 "d1" 中加载了你的第一个示例数据集,我可以这样总结它:
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
这显示了 所有 个职业的结果,而您只想要特定的三个,因此我们可以将其筛选为那些:
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
现在我们只需要将其概括为 运行宁超过 10 个数据集而无需剪切和粘贴。让我们将数据框存储在一个列表中:
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
现在我们可以 运行 使用相同的代码 lapply,然后返回一个列表:
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
结果:
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512
我有十个数据集,每个数据集包含 "ratings" 和 "occupation" 列。我想从这十个数据集中的每一个中找出每个三个职业组(即艺术家、技术人员、市场营销)的 "average" 的 "ratings"。
我写的代码如下:
Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
occupation=as.character(),
average.rating=as.numeric(),
stringsAsFactors=FALSE)
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)
Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)
e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))
e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating)))
e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"marketing",mean(e.1.marketing$rating)))
这显然根本没有效率,因为对于十个数据集,我必须将相同的代码再重写 9 次才能获得我所有十个数据集的每个职业组的平均评分。有一个更好的方法吗?我想不出更好的了!我发现 apply/lapply 可以做到这一点,但我无法弄清楚它们如何适用于我的情况。
Two of my datasets (e1 and e2) can be found here.(我只包含了每个观察结果的 10%)
您可以使用 tidyverse
包来汇总每个数据框。首先,您需要将它们放在一个列表中。然后你可以遍历列表中的每个数据框,按职业总结:
library(tidyverse)
# Create sample data
set.seed(2353)
sample_data <- rerun(10, tibble(
occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
ratings = sample(1:100, 100, replace = TRUE)
))
# Summarize by occupation
summarized_data <- sample_data %>%
map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))
另一种选择,带底座。首先将文件加载到列表中,然后使用 lapply
计算每个数据集的均值
# Set directory to a file that contains the files
files <- list.files()
# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)
# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))
# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])
d1 d2
artist 3.540984 3.612048
technician 3.519512 3.651106
marketing 3.147208 3.342569
请注意,如果您的数据已经全部加载,您可以将它们放入列表中(而不是直接将所有数据加载到列表中),然后使用 lapply
函数,它应该仍然有效.
编辑
我刚刚意识到你只想要三个组的方法。我已经编辑了上面的代码以将所有方法子集化为仅三个组。
我推荐使用 "plyr" 包进行这种操作;花一个小时左右的时间来学习是非常值得的。在你的例子中,我在 "d1" 中加载了你的第一个示例数据集,我可以这样总结它:
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
这显示了 所有 个职业的结果,而您只想要特定的三个,因此我们可以将其筛选为那些:
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
现在我们只需要将其概括为 运行宁超过 10 个数据集而无需剪切和粘贴。让我们将数据框存储在一个列表中:
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
现在我们可以 运行 使用相同的代码 lapply,然后返回一个列表:
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
结果:
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512