用 R 中的 NA 按列计算两个子组之间的差异
Calculate difference between two subgroups by column with NAs in R
我正在尝试计算 R 中带有 NA 的列中两个子组之间的绝对差异。或者更具体地说,我正在从事一个项目,我正在尝试计算立法唱名投票的党派偏见程度在 R 中。具体来说,我正在尝试计算不同的共和党人和民主党人在唱名表上的投票方式。我试图用我的数据计算的具体方程如下:
Roll Call Partisanship=|Democratic Aye % - GOP Aye %|
我的数据结构如下:
Legislator Party Vote1 Vote2 Vote3 Vote4 Vote5 Vote6 Vote7
Allen R yes no NA no yes yes no
Barber D NA no no yes no yes no
Cale D no NA yes yes yes no yes
Devin R no no no yes yes yes yes
Egan R yes yes yes NA no no no
Floyd R yes no yes no yes no yes
这是创建此 table:
的 R 代码
Legislator=c("Allen", "Barber", "Cale", "Devin", "Egan", "Floyd")
Party=c("R", "D", "D", "R", "R", "R")
vote1=c("yes", "NA", "no", "no", "yes", "yes")
vote2=c("no", "no", "NA", "no", "yes", "no")
vote3=c("NA", "no", "yes", "no", "yes", "yes")
vote4=c("no", "yes", "yes", "yes", "NA", "no")
vote5=c("yes", "no", "yes", "yes", "no", "yes")
vote6=c("yes", "yes", "no", "yes", "no", "no")
vote7=c("no", "no", "yes", "yes", "no", "yes")
rollcall=cbind(Legislator, Party, vote1, vote2, vote3, vote4, vote5, vote6, vote7)
使用上面的等式,我想创建一个如下所示的矩阵:
RollCall Partisanship
Vote1 0.75
Vote2 0.25
Vote3 0.17
Vote4 0.70
Vote5 0.25
Vote6 0.00
Vote7 0.00
有人对我如何在 R 中计算这些分数有任何建议吗?特别是,我在使用 NA 时遇到了麻烦。我希望没有在唱名表决中投票的立法者不被包括在特定的计算中。但是,如果您使用 na.omit,那么在所有点名计算中就完全排除了立法者。有没有人有什么建议?
这是一个data.table
解决方案:
library(data.table)
# convert your matrix to a data.table
dt <- data.table(rollcall)
# replace "NA"'s by actual NA's
dt[dt == "NA"] <- NA
# get your data in long format and calculate summary statistics
dt_long <- melt(dt, id.vars = "Party", measure = patterns("^vote"))
dt_long <- dt_long[!is.na(value),.(votes = sum(value=="yes") / .N), .(Party,variable)]
# spread the result to arrive at expected format
dcast(dt_long, variable ~ Party, value.var = "votes")[,.(Partisanship = abs(D - R)), "variable"]
# variable Partisanship
#1: vote1 0.7500000
#2: vote2 0.2500000
#3: vote3 0.1666667
#4: vote4 0.6666667
#5: vote5 0.2500000
#6: vote6 0.0000000
#7: vote7 0.0000000
这是一个 dplyr
的解决方案(比已经发布的解决方案更丑陋,但花了很多时间制作它所以现在仍然发布它):
# setting up the data
# **note that I've changed "NA" entries to NA **
Legislator <- c("Allen", "Barber", "Cale", "Devin", "Egan", "Floyd")
Party <- c("R", "D", "D", "R", "R", "R")
vote1 <- c("yes", NA, "no", "no", "yes", "yes")
vote2 <- c("no", "no", NA, "no", "yes", "no")
vote3 <- c(NA, "no", "yes", "no", "yes", "yes")
vote4 <- c("no", "yes", "yes", "yes", NA, "no")
vote5 <- c("yes", "no", "yes", "yes", "no", "yes")
vote6 <- c("yes", "yes", "no", "yes", "no", "no")
vote7 <- c("no", "no", "yes", "yes", "no", "yes")
rollcall <- as.data.frame(base::cbind(Legislator, Party, vote1, vote2, vote3, vote4, vote5, vote6, vote7))
# converting to long format
library(tidyr)
#> Warning: package 'tidyr' was built under R version 3.4.2
rollcall_long <- tidyr::gather(rollcall, vote, response, vote1:vote7, factor_key = TRUE)
# compute frenquency table
library(dplyr)
vote_frequency <- rollcall_long %>%
dplyr::filter(!is.na(response)) %>% # remove NAs
dplyr::group_by(Party, vote, response) %>% # compute frequency by these grouping variables
dplyr::summarize(counts = n()) %>% # get the count of each response
dplyr::mutate(perc = counts / sum(counts)) %>% # compute its percentage
dplyr::arrange(vote, response, Party) %>% # arrange it properly
dplyr::filter(response == "yes") %>% # select only yes responses ("Ayes")
dplyr::select(-counts, -response) # remove counts and response variables
# compute Partisanship score
Partisanship_df <- tidyr::spread(vote_frequency, Party, perc)
Partisanship_df[is.na(Partisanship_df)] <- 0 # replacing NA with 0 because NA here represents that not a single "yes" was found
Partisanship_df$Partisanship <- abs(Partisanship_df$D - Partisanship_df$R)
# removing unnecessary columns
Partisanship_df %>% dplyr::select(-c(R, D))
#> # A tibble: 7 x 2
#> # Groups: vote [7]
#> vote Partisanship
#> * <fct> <dbl>
#> 1 vote1 0.750
#> 2 vote2 0.250
#> 3 vote3 0.167
#> 4 vote4 0.667
#> 5 vote5 0.250
#> 6 vote6 0
#> 7 vote7 0
由 reprex 创建于 2018-01-20
包 (v0.1.1.9000).
我正在尝试计算 R 中带有 NA 的列中两个子组之间的绝对差异。或者更具体地说,我正在从事一个项目,我正在尝试计算立法唱名投票的党派偏见程度在 R 中。具体来说,我正在尝试计算不同的共和党人和民主党人在唱名表上的投票方式。我试图用我的数据计算的具体方程如下:
Roll Call Partisanship=|Democratic Aye % - GOP Aye %|
我的数据结构如下:
Legislator Party Vote1 Vote2 Vote3 Vote4 Vote5 Vote6 Vote7
Allen R yes no NA no yes yes no
Barber D NA no no yes no yes no
Cale D no NA yes yes yes no yes
Devin R no no no yes yes yes yes
Egan R yes yes yes NA no no no
Floyd R yes no yes no yes no yes
这是创建此 table:
的 R 代码Legislator=c("Allen", "Barber", "Cale", "Devin", "Egan", "Floyd")
Party=c("R", "D", "D", "R", "R", "R")
vote1=c("yes", "NA", "no", "no", "yes", "yes")
vote2=c("no", "no", "NA", "no", "yes", "no")
vote3=c("NA", "no", "yes", "no", "yes", "yes")
vote4=c("no", "yes", "yes", "yes", "NA", "no")
vote5=c("yes", "no", "yes", "yes", "no", "yes")
vote6=c("yes", "yes", "no", "yes", "no", "no")
vote7=c("no", "no", "yes", "yes", "no", "yes")
rollcall=cbind(Legislator, Party, vote1, vote2, vote3, vote4, vote5, vote6, vote7)
使用上面的等式,我想创建一个如下所示的矩阵:
RollCall Partisanship
Vote1 0.75
Vote2 0.25
Vote3 0.17
Vote4 0.70
Vote5 0.25
Vote6 0.00
Vote7 0.00
有人对我如何在 R 中计算这些分数有任何建议吗?特别是,我在使用 NA 时遇到了麻烦。我希望没有在唱名表决中投票的立法者不被包括在特定的计算中。但是,如果您使用 na.omit,那么在所有点名计算中就完全排除了立法者。有没有人有什么建议?
这是一个data.table
解决方案:
library(data.table)
# convert your matrix to a data.table
dt <- data.table(rollcall)
# replace "NA"'s by actual NA's
dt[dt == "NA"] <- NA
# get your data in long format and calculate summary statistics
dt_long <- melt(dt, id.vars = "Party", measure = patterns("^vote"))
dt_long <- dt_long[!is.na(value),.(votes = sum(value=="yes") / .N), .(Party,variable)]
# spread the result to arrive at expected format
dcast(dt_long, variable ~ Party, value.var = "votes")[,.(Partisanship = abs(D - R)), "variable"]
# variable Partisanship
#1: vote1 0.7500000
#2: vote2 0.2500000
#3: vote3 0.1666667
#4: vote4 0.6666667
#5: vote5 0.2500000
#6: vote6 0.0000000
#7: vote7 0.0000000
这是一个 dplyr
的解决方案(比已经发布的解决方案更丑陋,但花了很多时间制作它所以现在仍然发布它):
# setting up the data
# **note that I've changed "NA" entries to NA **
Legislator <- c("Allen", "Barber", "Cale", "Devin", "Egan", "Floyd")
Party <- c("R", "D", "D", "R", "R", "R")
vote1 <- c("yes", NA, "no", "no", "yes", "yes")
vote2 <- c("no", "no", NA, "no", "yes", "no")
vote3 <- c(NA, "no", "yes", "no", "yes", "yes")
vote4 <- c("no", "yes", "yes", "yes", NA, "no")
vote5 <- c("yes", "no", "yes", "yes", "no", "yes")
vote6 <- c("yes", "yes", "no", "yes", "no", "no")
vote7 <- c("no", "no", "yes", "yes", "no", "yes")
rollcall <- as.data.frame(base::cbind(Legislator, Party, vote1, vote2, vote3, vote4, vote5, vote6, vote7))
# converting to long format
library(tidyr)
#> Warning: package 'tidyr' was built under R version 3.4.2
rollcall_long <- tidyr::gather(rollcall, vote, response, vote1:vote7, factor_key = TRUE)
# compute frenquency table
library(dplyr)
vote_frequency <- rollcall_long %>%
dplyr::filter(!is.na(response)) %>% # remove NAs
dplyr::group_by(Party, vote, response) %>% # compute frequency by these grouping variables
dplyr::summarize(counts = n()) %>% # get the count of each response
dplyr::mutate(perc = counts / sum(counts)) %>% # compute its percentage
dplyr::arrange(vote, response, Party) %>% # arrange it properly
dplyr::filter(response == "yes") %>% # select only yes responses ("Ayes")
dplyr::select(-counts, -response) # remove counts and response variables
# compute Partisanship score
Partisanship_df <- tidyr::spread(vote_frequency, Party, perc)
Partisanship_df[is.na(Partisanship_df)] <- 0 # replacing NA with 0 because NA here represents that not a single "yes" was found
Partisanship_df$Partisanship <- abs(Partisanship_df$D - Partisanship_df$R)
# removing unnecessary columns
Partisanship_df %>% dplyr::select(-c(R, D))
#> # A tibble: 7 x 2
#> # Groups: vote [7]
#> vote Partisanship
#> * <fct> <dbl>
#> 1 vote1 0.750
#> 2 vote2 0.250
#> 3 vote3 0.167
#> 4 vote4 0.667
#> 5 vote5 0.250
#> 6 vote6 0
#> 7 vote7 0
由 reprex 创建于 2018-01-20 包 (v0.1.1.9000).