R:计算新列(mean/median)
R: Calculating new column (mean/median)
我想计算一列的平均值或中位数,但能够select哪些值是根据另一列计算的。 (见下面的数据表)
仅计算百分比列的 mean/median 似乎没问题,但我在基于其他 select 离子执行此操作时遇到了一些麻烦。例如,日期为“2014”的所有条目的百分比中位数。
任何有关如何执行此操作的建议都将不胜感激!如果在 SO 的其他地方已经回答了这个问题,我深表歉意,但我找不到它。
如果需要重现数据,我的代码列在下面。
#Step 1: Load needed library
library(tidyverse)
library(rvest)
library(jsonlite)
library(stringi)
library(dplyr)
library(data.table)
library(ggplot2)
#Step 2: Access the URL of where the data is located
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
#Step 3: Direct JSON as format of data in URL
data <- jsonlite::fromJSON(url, flatten = TRUE)
#Step 4: Access all items in API
totalItems <- data$TotalNumberOfItems
#Step 5: Summarize all data from API
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
#Step 6: removing colunms not needed
allData <- allData[, -c(1,4,8,9,11,12,13,14,15)]
#Step 7: remove whitespace and change to numeric in columns SoldAmount and Tax
#
allData[c("Tax", "SoldAmount")] <- lapply(allData[c("Tax", "SoldAmount")], function(z) as.numeric(gsub(" ", "", z)))
#Step 8: Remove rows where value is NA
#
alldata <- allData %>%
filter(across(where(is.numeric),
~ !is.na(.)))
#Step 9: Remove values below 10000 NOK on SoldAmount og Tax.
alldata <- alldata %>%
filter_all(any_vars(is.numeric(.) & . > 10000))
#Step 10: Calculate percentage change between tax and sold amount and create new column with percent change
#df %>% mutate(Percentage = number/sum(number))
alldata_Percent <- alldata %>% mutate(Percentage = (SoldAmount-Tax)/Tax)
您只是在寻找 dplyr
中的 group_by
和 summarize
吗?
alldata_Percent %>%
group_by(Date) %>%
summarize(median_percent = median(Percentage),
mean_percent = mean(Percentage))
## A tibble: 15 x 3
#> Date median_percent mean_percent
#> <chr> <dbl> <dbl>
#> 1 1970 0 1.98
#> 2 2003 0 -0.0345
#> 3 2004 0 0.141
#> 4 2005 0.0723 0.156
#> 5 2006 0.0132 0.204
#> 6 2007 0.024 0.131
#> 7 2008 0 -0.00499
#> 8 2009 0.0247 0.0769
#> 9 2010 0.0340 0.0422
#> 10 2011 0 0.155
#> 11 2012 0 0.0103
#> 12 2013 0 0.0571
#> 13 2014 0 0.0352
#> 14 2015 0 0.0646
#> 15 2016 0 -0.0195
我想计算一列的平均值或中位数,但能够select哪些值是根据另一列计算的。 (见下面的数据表)
仅计算百分比列的 mean/median 似乎没问题,但我在基于其他 select 离子执行此操作时遇到了一些麻烦。例如,日期为“2014”的所有条目的百分比中位数。
任何有关如何执行此操作的建议都将不胜感激!如果在 SO 的其他地方已经回答了这个问题,我深表歉意,但我找不到它。
如果需要重现数据,我的代码列在下面。
#Step 1: Load needed library
library(tidyverse)
library(rvest)
library(jsonlite)
library(stringi)
library(dplyr)
library(data.table)
library(ggplot2)
#Step 2: Access the URL of where the data is located
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
#Step 3: Direct JSON as format of data in URL
data <- jsonlite::fromJSON(url, flatten = TRUE)
#Step 4: Access all items in API
totalItems <- data$TotalNumberOfItems
#Step 5: Summarize all data from API
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
#Step 6: removing colunms not needed
allData <- allData[, -c(1,4,8,9,11,12,13,14,15)]
#Step 7: remove whitespace and change to numeric in columns SoldAmount and Tax
#
allData[c("Tax", "SoldAmount")] <- lapply(allData[c("Tax", "SoldAmount")], function(z) as.numeric(gsub(" ", "", z)))
#Step 8: Remove rows where value is NA
#
alldata <- allData %>%
filter(across(where(is.numeric),
~ !is.na(.)))
#Step 9: Remove values below 10000 NOK on SoldAmount og Tax.
alldata <- alldata %>%
filter_all(any_vars(is.numeric(.) & . > 10000))
#Step 10: Calculate percentage change between tax and sold amount and create new column with percent change
#df %>% mutate(Percentage = number/sum(number))
alldata_Percent <- alldata %>% mutate(Percentage = (SoldAmount-Tax)/Tax)
您只是在寻找 dplyr
中的 group_by
和 summarize
吗?
alldata_Percent %>%
group_by(Date) %>%
summarize(median_percent = median(Percentage),
mean_percent = mean(Percentage))
## A tibble: 15 x 3
#> Date median_percent mean_percent
#> <chr> <dbl> <dbl>
#> 1 1970 0 1.98
#> 2 2003 0 -0.0345
#> 3 2004 0 0.141
#> 4 2005 0.0723 0.156
#> 5 2006 0.0132 0.204
#> 6 2007 0.024 0.131
#> 7 2008 0 -0.00499
#> 8 2009 0.0247 0.0769
#> 9 2010 0.0340 0.0422
#> 10 2011 0 0.155
#> 11 2012 0 0.0103
#> 12 2013 0 0.0571
#> 13 2014 0 0.0352
#> 14 2015 0 0.0646
#> 15 2016 0 -0.0195