如何根据(按元素)选定的相邻列计算按行重复的计数
How to calculate a row-wise count of duplicates based on (element-wise) selected adjacent columns
我有一个数据框测试:
group userID A_conf A_chall B_conf B_chall
1 220 1 1 1 2
1 222 4 6 4 4
2 223 6 5 3 2
1 224 1 5 4 4
2 228 4 4 4 4
数据包含每个用户的响应(由用户 ID 显示),其中每个用户可以为两个度量输入 1 到 6 之间的任何值:
- 配置文件
- 挑战
他们也可以选择不回复,从而导致 NA 条目。
test 数据框包含多个列,如 A、B、C、D 等。可以分别为这些列中的每一列报告 Conf 和 Chall 度量。
我有兴趣进行以下比较:
- A_conf & A_chall
- B_conf & B_chall
如果这些度量值中的任何一个相等,则 Final 计数器应递增(如下所示)。
group userID A_conf A_chall B_conf B_chall Final
1 220 1 1 1 2 1
1 222 4 6 4 4 1
2 223 6 5 3 2 0
1 224 1 5 4 4 1
2 228 4 4 4 4 2
我正在为 Final 计数器而苦苦挣扎。什么脚本可以帮助我实现此功能?
作为参考,test 数据帧集的输出共享如下:
dput(测试):
结构(列表(组=c(1L,1L,2L,1L,2L),
用户 ID = c(220L, 222L, 223L, 224L, 228L),
A_conf = c(1L, 4L, 6L, 1L, 4L),
A_chall = c(1L, 6L, 5L, 5L, 4L),
B_conf = c(1L, 4L, 3L, 4L, 4L),
B_chall = c(2L, 4L, 2L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -5L))
我试过这样的代码:
test$Final = as.integer(0) # add a column to keep counts
count_inc = as.integer(0) # counter variable to increment in steps of 1
for (i in 1:nrow(test)) {
count_inc = 0
if(!is.na(test$A_conf[i] == test$A_chall[i]))
{
count_inc = 1
test$Final[i] = count_inc
}#if
else if(!is.na(test$A_conf[i] != test$A_chall[i]))
{
count_inc = 0
test$Final[i] = count_inc
}#else if
}#for
以上代码仅适用于 A_conf 和 A_chall 列。问题是,无论输入的值(用户)是否相等,它都会用全 1 填充 Final 列。
使用 tidyverse
你可以:
df %>%
select(-Final) %>%
rowid_to_column() %>% #Creating an unique row ID
gather(var, val, -c(group, userID, rowid)) %>% #Reshaping the data
arrange(rowid, var) %>% #Arranging by row ID and by variables
group_by(rowid) %>% #Grouping by row ID
mutate(temp = gl(n()/2, 2)) %>% #Creating a grouping variable for different "_chall" and "_conf" variables
group_by(rowid, temp) %>% #Grouping by row ID and the new grouping variables
mutate(res = ifelse(val == lag(val), 1, 0)) %>% #Comparing whether the different "_chall" and "_conf" have the same value
group_by(rowid) %>% #Grouping by row ID
mutate(res = sum(res, na.rm = TRUE)) %>% #Summing the occurrences of "_chall" and "_conf" being the same
select(-temp) %>%
spread(var, val) %>% #Returning the data to its original form
ungroup() %>%
select(-rowid)
group userID res A_chall A_conf B_chall B_conf
<int> <int> <dbl> <int> <int> <int> <int>
1 1 220 1. 1 1 2 1
2 1 222 1. 6 4 4 4
3 2 223 0. 5 6 2 3
4 1 224 1. 5 1 4 4
5 2 228 2. 4 4 4 4
你也可以试试这个 tidyverse。与其他答案相比少了一些行;)
library(tidyverse)
d %>%
as.tibble() %>%
gather(k, v, -group,-userID) %>%
separate(k, into = c("letters", "test")) %>%
spread(test, v) %>%
group_by(userID) %>%
mutate(final = sum(chall == conf)) %>%
distinct(userID, final) %>%
ungroup() %>%
right_join(d)
# A tibble: 5 x 7
userID final group A_conf A_chall B_conf B_chall
<int> <int> <int> <int> <int> <int> <int>
1 220 1 1 1 1 1 2
2 222 1 1 4 6 4 4
3 223 0 2 6 5 3 2
4 224 1 1 1 5 4 4
5 228 2 2 4 4 4 4
一个基本的 R 解决方案,假设你有相同数量的 "conf" 和 "chall" 列
#Find indexes of "conf" column
conf_col <- grep("conf", names(test))
#Find indexes of "chall" column
chall_col <- grep("chall", names(test))
#compare element wise and take row wise sum
test$Final <- rowSums(test[conf_col] == test[chall_col])
test
# group userID A_conf A_chall B_conf B_chall Final
#1 1 220 1 1 1 2 1
#2 1 222 4 6 4 4 1
#3 2 223 6 5 3 2 0
#4 1 224 1 5 4 4 1
#5 2 228 4 4 4 4 2
也可以单行完成
rowSums(test[grep("conf", names(test))] == test[grep("chall", names(test))])
我有一个数据框测试:
group userID A_conf A_chall B_conf B_chall
1 220 1 1 1 2
1 222 4 6 4 4
2 223 6 5 3 2
1 224 1 5 4 4
2 228 4 4 4 4
数据包含每个用户的响应(由用户 ID 显示),其中每个用户可以为两个度量输入 1 到 6 之间的任何值:
- 配置文件
- 挑战
他们也可以选择不回复,从而导致 NA 条目。
test 数据框包含多个列,如 A、B、C、D 等。可以分别为这些列中的每一列报告 Conf 和 Chall 度量。
我有兴趣进行以下比较:
- A_conf & A_chall
- B_conf & B_chall
如果这些度量值中的任何一个相等,则 Final 计数器应递增(如下所示)。
group userID A_conf A_chall B_conf B_chall Final
1 220 1 1 1 2 1
1 222 4 6 4 4 1
2 223 6 5 3 2 0
1 224 1 5 4 4 1
2 228 4 4 4 4 2
我正在为 Final 计数器而苦苦挣扎。什么脚本可以帮助我实现此功能?
作为参考,test 数据帧集的输出共享如下:
dput(测试):
结构(列表(组=c(1L,1L,2L,1L,2L),
用户 ID = c(220L, 222L, 223L, 224L, 228L),
A_conf = c(1L, 4L, 6L, 1L, 4L),
A_chall = c(1L, 6L, 5L, 5L, 4L),
B_conf = c(1L, 4L, 3L, 4L, 4L),
B_chall = c(2L, 4L, 2L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -5L))
我试过这样的代码:
test$Final = as.integer(0) # add a column to keep counts
count_inc = as.integer(0) # counter variable to increment in steps of 1
for (i in 1:nrow(test)) {
count_inc = 0
if(!is.na(test$A_conf[i] == test$A_chall[i]))
{
count_inc = 1
test$Final[i] = count_inc
}#if
else if(!is.na(test$A_conf[i] != test$A_chall[i]))
{
count_inc = 0
test$Final[i] = count_inc
}#else if
}#for
以上代码仅适用于 A_conf 和 A_chall 列。问题是,无论输入的值(用户)是否相等,它都会用全 1 填充 Final 列。
使用 tidyverse
你可以:
df %>%
select(-Final) %>%
rowid_to_column() %>% #Creating an unique row ID
gather(var, val, -c(group, userID, rowid)) %>% #Reshaping the data
arrange(rowid, var) %>% #Arranging by row ID and by variables
group_by(rowid) %>% #Grouping by row ID
mutate(temp = gl(n()/2, 2)) %>% #Creating a grouping variable for different "_chall" and "_conf" variables
group_by(rowid, temp) %>% #Grouping by row ID and the new grouping variables
mutate(res = ifelse(val == lag(val), 1, 0)) %>% #Comparing whether the different "_chall" and "_conf" have the same value
group_by(rowid) %>% #Grouping by row ID
mutate(res = sum(res, na.rm = TRUE)) %>% #Summing the occurrences of "_chall" and "_conf" being the same
select(-temp) %>%
spread(var, val) %>% #Returning the data to its original form
ungroup() %>%
select(-rowid)
group userID res A_chall A_conf B_chall B_conf
<int> <int> <dbl> <int> <int> <int> <int>
1 1 220 1. 1 1 2 1
2 1 222 1. 6 4 4 4
3 2 223 0. 5 6 2 3
4 1 224 1. 5 1 4 4
5 2 228 2. 4 4 4 4
你也可以试试这个 tidyverse。与其他答案相比少了一些行;)
library(tidyverse)
d %>%
as.tibble() %>%
gather(k, v, -group,-userID) %>%
separate(k, into = c("letters", "test")) %>%
spread(test, v) %>%
group_by(userID) %>%
mutate(final = sum(chall == conf)) %>%
distinct(userID, final) %>%
ungroup() %>%
right_join(d)
# A tibble: 5 x 7
userID final group A_conf A_chall B_conf B_chall
<int> <int> <int> <int> <int> <int> <int>
1 220 1 1 1 1 1 2
2 222 1 1 4 6 4 4
3 223 0 2 6 5 3 2
4 224 1 1 1 5 4 4
5 228 2 2 4 4 4 4
一个基本的 R 解决方案,假设你有相同数量的 "conf" 和 "chall" 列
#Find indexes of "conf" column
conf_col <- grep("conf", names(test))
#Find indexes of "chall" column
chall_col <- grep("chall", names(test))
#compare element wise and take row wise sum
test$Final <- rowSums(test[conf_col] == test[chall_col])
test
# group userID A_conf A_chall B_conf B_chall Final
#1 1 220 1 1 1 2 1
#2 1 222 4 6 4 4 1
#3 2 223 6 5 3 2 0
#4 1 224 1 5 4 4 1
#5 2 228 4 4 4 4 2
也可以单行完成
rowSums(test[grep("conf", names(test))] == test[grep("chall", names(test))])