使用查询汇总数据集中的选定条目
Summarize selected entries in a data set with a query
我对 R 还是很陌生,尝试以特定的方式总结数据。为了在这里进行说明,我使用了 nasaweather 包中的天气数据。例如,我想获取特定日期的平均温度,并显示此数据集中包含的 3 个来源和 12 个月。
我想我可以使用下面的代码完成它,我在其中指定我感兴趣的日期,创建一个要填充的空数据框,然后 运行 循环几个月我计算每个原点的平均温度,将它们与月份绑定,然后将它们绑定到数据框。最后我调整列名并打印出结果:
library(nasaweather)
library(magrittr)
library(dplyr)
query_day = 15
data_output <- data.frame(month = numeric(),
EWR = numeric(),
JFK = numeric(),
LGA = numeric())
for (i in 1:12) {
data_subset <- weather %>%
filter(day == query_day, month == i) %>%
summarize(
EWR = mean(temp[origin == "EWR"]),
JFK = mean(temp[origin == "JFK"]),
LGA = mean(temp[origin == "LGA"]))
data_output <- rbind(data_output, cbind(i, data_subset))
rm(data_subset)
}
names(data_output) <- c("month", "EWR", "JFK", "LGA")
print(data_output)
在我手中,这会产生以下结果:
month EWR JFK LGA
1 1 39.3725 39.0875 38.9150
2 2 42.1625 39.3425 42.9050
3 3 37.4150 36.7775 37.3025
4 4 50.1275 48.1550 49.2050
5 5 58.8725 55.7150 59.1575
6 6 70.7825 70.2950 71.5700
7 7 86.9900 85.1225 87.2000
8 8 69.2075 69.0725 69.9425
9 9 60.6350 61.2125 61.7375
10 10 59.8850 58.3850 60.5150
11 11 45.7475 45.1700 49.0700
12 12 32.4950 38.0975 34.0325
这正是我想要的。我只是觉得我的代码似乎太复杂了,想问一下是否有更简单的方法来完成这项工作?
您的代码存在各种问题...但主要问题是您没有首先 group_by。一旦你包括了它,这就变得很容易了。先看我对你代码的注释,再看最下面的定稿代码:
library(nasaweather) ## Wrong package
# library(magrittr) ## not needed, it's called by dplyr
library(dplyr)
query_day = 15
# data_output <- data.frame(month = numeric(), ## We won't need to specify this explicitly
## (but you are right that you should specify this in a for loop. Go one step
## further by actually telling the data.frame how many rows to expect.
## But not in this case cause we won't use for loop)
# EWR = numeric(),
# JFK = numeric(),
# LGA = numeric())
for (i in 1:12) { ## You don't need to do a for loop... you can do it with the summarize_by function.
data_subset <- weather %>%
filter(day == query_day, month == i) %>%
summarize( ## Before doing summarize, you need a group_by to say what to summarize_by
EWR = mean(temp[origin == "EWR"]),
JFK = mean(temp[origin == "JFK"]),
LGA = mean(temp[origin == "LGA"]))
data_output <- rbind(data_output, cbind(i, data_subset)) ## If you're doing the group_by, this step isn't required.
# rm(data_subset) ## You don't have to remove temporary datasets...
## When the for loop ends, they are automatically removed.
}
names(data_output) <- c("month", "EWR", "JFK", "LGA")
print(data_output)
################### Better code:
library(nycflights13) ## your the package you waant is nycflights13... not nasaweather
library(dplyr)
query_day = 15
weather %>%
filter(day == query_day) %>%
group_by(month) %>%
summarize(
EWR = mean(temp[origin == "EWR"]),
JFK = mean(temp[origin == "JFK"]),
LGA = mean(temp[origin == "LGA"])) -> data_output
data_output
产量:
> data_output
# A tibble: 12 × 4
month EWR JFK LGA
<dbl> <dbl> <dbl> <dbl>
1 1 39.3725 39.0875 38.9150
2 2 42.1625 39.3425 42.9050
3 3 37.4150 36.7775 37.3025
4 4 50.1275 48.1550 49.2050
5 5 58.8725 55.7150 59.1575
6 6 70.7825 70.2950 71.5700
7 7 86.9900 85.1225 87.2000
8 8 69.2075 69.0725 69.9425
9 9 60.6350 61.2125 61.7375
10 10 59.8850 58.3850 60.5150
11 11 45.7475 45.1700 49.0700
12 12 32.4950 38.0975 34.0325
我对 R 还是很陌生,尝试以特定的方式总结数据。为了在这里进行说明,我使用了 nasaweather 包中的天气数据。例如,我想获取特定日期的平均温度,并显示此数据集中包含的 3 个来源和 12 个月。
我想我可以使用下面的代码完成它,我在其中指定我感兴趣的日期,创建一个要填充的空数据框,然后 运行 循环几个月我计算每个原点的平均温度,将它们与月份绑定,然后将它们绑定到数据框。最后我调整列名并打印出结果:
library(nasaweather)
library(magrittr)
library(dplyr)
query_day = 15
data_output <- data.frame(month = numeric(),
EWR = numeric(),
JFK = numeric(),
LGA = numeric())
for (i in 1:12) {
data_subset <- weather %>%
filter(day == query_day, month == i) %>%
summarize(
EWR = mean(temp[origin == "EWR"]),
JFK = mean(temp[origin == "JFK"]),
LGA = mean(temp[origin == "LGA"]))
data_output <- rbind(data_output, cbind(i, data_subset))
rm(data_subset)
}
names(data_output) <- c("month", "EWR", "JFK", "LGA")
print(data_output)
在我手中,这会产生以下结果:
month EWR JFK LGA
1 1 39.3725 39.0875 38.9150
2 2 42.1625 39.3425 42.9050
3 3 37.4150 36.7775 37.3025
4 4 50.1275 48.1550 49.2050
5 5 58.8725 55.7150 59.1575
6 6 70.7825 70.2950 71.5700
7 7 86.9900 85.1225 87.2000
8 8 69.2075 69.0725 69.9425
9 9 60.6350 61.2125 61.7375
10 10 59.8850 58.3850 60.5150
11 11 45.7475 45.1700 49.0700
12 12 32.4950 38.0975 34.0325
这正是我想要的。我只是觉得我的代码似乎太复杂了,想问一下是否有更简单的方法来完成这项工作?
您的代码存在各种问题...但主要问题是您没有首先 group_by。一旦你包括了它,这就变得很容易了。先看我对你代码的注释,再看最下面的定稿代码:
library(nasaweather) ## Wrong package
# library(magrittr) ## not needed, it's called by dplyr
library(dplyr)
query_day = 15
# data_output <- data.frame(month = numeric(), ## We won't need to specify this explicitly
## (but you are right that you should specify this in a for loop. Go one step
## further by actually telling the data.frame how many rows to expect.
## But not in this case cause we won't use for loop)
# EWR = numeric(),
# JFK = numeric(),
# LGA = numeric())
for (i in 1:12) { ## You don't need to do a for loop... you can do it with the summarize_by function.
data_subset <- weather %>%
filter(day == query_day, month == i) %>%
summarize( ## Before doing summarize, you need a group_by to say what to summarize_by
EWR = mean(temp[origin == "EWR"]),
JFK = mean(temp[origin == "JFK"]),
LGA = mean(temp[origin == "LGA"]))
data_output <- rbind(data_output, cbind(i, data_subset)) ## If you're doing the group_by, this step isn't required.
# rm(data_subset) ## You don't have to remove temporary datasets...
## When the for loop ends, they are automatically removed.
}
names(data_output) <- c("month", "EWR", "JFK", "LGA")
print(data_output)
################### Better code:
library(nycflights13) ## your the package you waant is nycflights13... not nasaweather
library(dplyr)
query_day = 15
weather %>%
filter(day == query_day) %>%
group_by(month) %>%
summarize(
EWR = mean(temp[origin == "EWR"]),
JFK = mean(temp[origin == "JFK"]),
LGA = mean(temp[origin == "LGA"])) -> data_output
data_output
产量:
> data_output
# A tibble: 12 × 4
month EWR JFK LGA
<dbl> <dbl> <dbl> <dbl>
1 1 39.3725 39.0875 38.9150
2 2 42.1625 39.3425 42.9050
3 3 37.4150 36.7775 37.3025
4 4 50.1275 48.1550 49.2050
5 5 58.8725 55.7150 59.1575
6 6 70.7825 70.2950 71.5700
7 7 86.9900 85.1225 87.2000
8 8 69.2075 69.0725 69.9425
9 9 60.6350 61.2125 61.7375
10 10 59.8850 58.3850 60.5150
11 11 45.7475 45.1700 49.0700
12 12 32.4950 38.0975 34.0325