r中的描述性统计
descriptive statistics in r
我正在尝试获取我的数据的描述性统计信息。我经历了很多建议。但是我只想知道是否有任何软件包可以对下面提供的数据格式执行描述性统计。
head(mydata)
X A1 A2 A3 M1 M2 M3 U1 U2 U3
1 A A A M M M U U U
2 X1 100 200 250 200 230 400 400 100 200
3 X2 600 300 400 300 550 750 800 900 540
4 X3 500 300 200 200 200 100 500 400 600
数据在列上有样本,在行上有变量。第一行是样本名称,第二行是组(A、M、U)。我想获得每个组的描述性统计数据。例如,对于 A 组(A1、A2、A3),均值为 sd....。任何人都可以让我知道我该怎么做。我已经看到了大多数关于描述性统计数据的答案,它们都是针对列的。
如果问题不清楚,请告诉我。
感谢您的帮助。
希格斯粒子
@Phil 的建议完全正确。
您将在 Hadley 的书中学到的关键原则之一是整洁数据原则(非常基本:列中的变量,行中的单个观察值)。如果你想快速了解 tidy 数据,试试这个 vignette。
修复和分析数据的方法有多种,但这里有一个使用 'tidyverse'.
工具的示例
# Load useful 'tidy data' packages
library(tidyverse)
# Make 'mydata'
mydata <- data_frame(X = c('', 'X1', 'X2', 'X3'),
A1 = c('A', 100, 600, 500),
A2 = c('A', 200, 300, 300),
A3 = c('A', 250, 400, 200),
M1 = c('M', 200, 300, 200),
M2 = c('M', 230, 550, 200),
M3 = c('M', 400, 750, 100),
U1 = c('U', 400, 800, 500),
U2 = c('U', 100, 900, 400),
U3 = c('U', 200, 540, 600))
# View 'mydata'
mydata
#> # A tibble: 4 x 10
#> X A1 A2 A3 M1 M2 M3 U1 U2 U3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "" A A A M M M U U U
#> 2 X1 100 200 250 200 230 400 400 100 200
#> 3 X2 600 300 400 300 550 750 800 900 540
#> 4 X3 500 300 200 200 200 100 500 400 600
转换为整洁的数据帧
# Transpose rows and columns and convert resulting matrix back into a dataframe
mydata_new <- as_data_frame(t(mydata))
# View 'mydata_new'
mydata_new
#> # A tibble: 10 x 4
#> V1 V2 V3 V4
#> <chr> <chr> <chr> <chr>
#> 1 "" X1 X2 X3
#> 2 A 100 600 500
#> 3 A 200 300 300
#> 4 A 250 400 200
#> 5 M 200 300 200
#> 6 M 230 550 200
#> 7 M 400 750 100
#> 8 U 400 800 500
#> 9 U 100 900 400
#> 10 U 200 540 600
# Clean 'mydata_new'
## Add column names
colnames(mydata_new) <- c('Group', 'X1', 'X2', 'X3')
## Remove first row
mydata_new <- mydata_new[-1, ]
# View cleaned 'mydata_new'
mydata_new
#> # A tibble: 9 x 4
#> Group X1 X2 X3
#> <chr> <chr> <chr> <chr>
#> 1 A 100 600 500
#> 2 A 200 300 300
#> 3 A 250 400 200
#> 4 M 200 300 200
#> 5 M 230 550 200
#> 6 M 400 750 100
#> 7 U 400 800 500
#> 8 U 100 900 400
#> 9 U 200 540 600
现在总结一下数据。
# Summarise numeric data
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate MEAN and SD for each data column
summarise_at(vars(starts_with('X')),
funs(MEAN = mean, SD = sd))
#> # A tibble: 3 x 7
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153.
#> 2 M 277. 533. 167. 108. 225. 57.7
#> 3 U 233. 747. 500 153. 186. 100
更新:2018 年 5 月 10 日以下有关添加变异系数的查询。
变异系数不是基本 R 函数,因此创建一个用户定义的函数。
# Define function: (cv = sd / mean)
coef_var = function(x) {
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
}
重新执行添加了汇总功能的汇总
# Execute summary
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate summaries each data column
## Call the summary functions with a dummy "." argument so that
## Additional arguments can be added to the called functions
## (e.g., adding na.rm = TRUE to cope with missing data)
## See ?dplyr::funs for details
summarise_at(vars(starts_with('X')),
funs(MEAN = mean(., na.rm = TRUE), # Mean
SD = sd(., na.rm = TRUE), # SD
CV = coef_var, # Coefficient of variation
# Add other summary stats as needed
MEDIAN = median(., na.rm = TRUE), # Median
Q25 = quantile(., prob = 0.25, na.rm = TRUE), # 25th percentile
Q75 = quantile(., prob = 0.75, na.rm = TRUE), # 75th percentile
min = min(., na.rm = TRUE), # Minimum
max = max(., na.rm = TRUE))) # Maximum
#> # A tibble: 3 x 25
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD X1_CV X2_CV X3_CV
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153. 0.417 0.353 0.458
#> 2 M 277. 533. 167. 108. 225. 57.7 0.390 0.423 0.346
#> 3 U 233. 747. 500 153. 186. 100 0.655 0.249 0.2
#> # ... with 15 more variables: X1_MEDIAN <dbl>, X2_MEDIAN <dbl>,
#> # X3_MEDIAN <dbl>, X1_Q25 <dbl>, X2_Q25 <dbl>, X3_Q25 <dbl>,
#> # X1_Q75 <dbl>, X2_Q75 <dbl>, X3_Q75 <dbl>, X1_min <dbl>, X2_min <dbl>,
#> # X3_min <dbl>, X1_max <dbl>, X2_max <dbl>, X3_max <dbl>
由 reprex package (v0.2.0) 创建于 2018-05-10。
我正在尝试获取我的数据的描述性统计信息。我经历了很多建议。但是我只想知道是否有任何软件包可以对下面提供的数据格式执行描述性统计。
head(mydata)
X A1 A2 A3 M1 M2 M3 U1 U2 U3
1 A A A M M M U U U
2 X1 100 200 250 200 230 400 400 100 200
3 X2 600 300 400 300 550 750 800 900 540
4 X3 500 300 200 200 200 100 500 400 600
数据在列上有样本,在行上有变量。第一行是样本名称,第二行是组(A、M、U)。我想获得每个组的描述性统计数据。例如,对于 A 组(A1、A2、A3),均值为 sd....。任何人都可以让我知道我该怎么做。我已经看到了大多数关于描述性统计数据的答案,它们都是针对列的。 如果问题不清楚,请告诉我。 感谢您的帮助。
希格斯粒子
@Phil 的建议完全正确。
您将在 Hadley 的书中学到的关键原则之一是整洁数据原则(非常基本:列中的变量,行中的单个观察值)。如果你想快速了解 tidy 数据,试试这个 vignette。
修复和分析数据的方法有多种,但这里有一个使用 'tidyverse'.
工具的示例# Load useful 'tidy data' packages
library(tidyverse)
# Make 'mydata'
mydata <- data_frame(X = c('', 'X1', 'X2', 'X3'),
A1 = c('A', 100, 600, 500),
A2 = c('A', 200, 300, 300),
A3 = c('A', 250, 400, 200),
M1 = c('M', 200, 300, 200),
M2 = c('M', 230, 550, 200),
M3 = c('M', 400, 750, 100),
U1 = c('U', 400, 800, 500),
U2 = c('U', 100, 900, 400),
U3 = c('U', 200, 540, 600))
# View 'mydata'
mydata
#> # A tibble: 4 x 10
#> X A1 A2 A3 M1 M2 M3 U1 U2 U3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "" A A A M M M U U U
#> 2 X1 100 200 250 200 230 400 400 100 200
#> 3 X2 600 300 400 300 550 750 800 900 540
#> 4 X3 500 300 200 200 200 100 500 400 600
转换为整洁的数据帧
# Transpose rows and columns and convert resulting matrix back into a dataframe
mydata_new <- as_data_frame(t(mydata))
# View 'mydata_new'
mydata_new
#> # A tibble: 10 x 4
#> V1 V2 V3 V4
#> <chr> <chr> <chr> <chr>
#> 1 "" X1 X2 X3
#> 2 A 100 600 500
#> 3 A 200 300 300
#> 4 A 250 400 200
#> 5 M 200 300 200
#> 6 M 230 550 200
#> 7 M 400 750 100
#> 8 U 400 800 500
#> 9 U 100 900 400
#> 10 U 200 540 600
# Clean 'mydata_new'
## Add column names
colnames(mydata_new) <- c('Group', 'X1', 'X2', 'X3')
## Remove first row
mydata_new <- mydata_new[-1, ]
# View cleaned 'mydata_new'
mydata_new
#> # A tibble: 9 x 4
#> Group X1 X2 X3
#> <chr> <chr> <chr> <chr>
#> 1 A 100 600 500
#> 2 A 200 300 300
#> 3 A 250 400 200
#> 4 M 200 300 200
#> 5 M 230 550 200
#> 6 M 400 750 100
#> 7 U 400 800 500
#> 8 U 100 900 400
#> 9 U 200 540 600
现在总结一下数据。
# Summarise numeric data
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate MEAN and SD for each data column
summarise_at(vars(starts_with('X')),
funs(MEAN = mean, SD = sd))
#> # A tibble: 3 x 7
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153.
#> 2 M 277. 533. 167. 108. 225. 57.7
#> 3 U 233. 747. 500 153. 186. 100
更新:2018 年 5 月 10 日以下有关添加变异系数的查询。
变异系数不是基本 R 函数,因此创建一个用户定义的函数。
# Define function: (cv = sd / mean)
coef_var = function(x) {
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
}
重新执行添加了汇总功能的汇总
# Execute summary
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate summaries each data column
## Call the summary functions with a dummy "." argument so that
## Additional arguments can be added to the called functions
## (e.g., adding na.rm = TRUE to cope with missing data)
## See ?dplyr::funs for details
summarise_at(vars(starts_with('X')),
funs(MEAN = mean(., na.rm = TRUE), # Mean
SD = sd(., na.rm = TRUE), # SD
CV = coef_var, # Coefficient of variation
# Add other summary stats as needed
MEDIAN = median(., na.rm = TRUE), # Median
Q25 = quantile(., prob = 0.25, na.rm = TRUE), # 25th percentile
Q75 = quantile(., prob = 0.75, na.rm = TRUE), # 75th percentile
min = min(., na.rm = TRUE), # Minimum
max = max(., na.rm = TRUE))) # Maximum
#> # A tibble: 3 x 25
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD X1_CV X2_CV X3_CV
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153. 0.417 0.353 0.458
#> 2 M 277. 533. 167. 108. 225. 57.7 0.390 0.423 0.346
#> 3 U 233. 747. 500 153. 186. 100 0.655 0.249 0.2
#> # ... with 15 more variables: X1_MEDIAN <dbl>, X2_MEDIAN <dbl>,
#> # X3_MEDIAN <dbl>, X1_Q25 <dbl>, X2_Q25 <dbl>, X3_Q25 <dbl>,
#> # X1_Q75 <dbl>, X2_Q75 <dbl>, X3_Q75 <dbl>, X1_min <dbl>, X2_min <dbl>,
#> # X3_min <dbl>, X1_max <dbl>, X2_max <dbl>, X3_max <dbl>
由 reprex package (v0.2.0) 创建于 2018-05-10。