如何将数据框中第一行的字符串与其他每一行进行比较,并计算 R 中不匹配的数量?
How to compare strings from the first row to every other row in a data frame and count the number of mismatches in R?
我有一个包含数千行和几列的数据框,我需要在其中计算字符变量从第一行到其他每一行的变化(row1–row2、row1–row3、row1–row4,...)和将更改总数输出到新列中。
df <- data_frame(
a = c("1 2", "1 2", "2 2", "2 2"),
b = c("2 1", "1 2", "1 2","1 2"),
c = c("1 1", "1 2", "2 1","2 2"),
d = c("1 1", "1 1", "2 1","2 1")
)
df
a b c d
<chr> <chr> <chr> <chr>
1 1 2 2 1 1 1 1 1
2 1 2 1 2 1 2 1 1
3 2 2 1 2 2 1 2 1
4 2 2 1 2 2 2 2 1
我想统计第1行到第2行、第1行到第3行等每个元素之间的字符不匹配情况。这样我就明白了:
a b c d e
1 1 2 2 1 1 1 1 1 NA #No mismatches to count since this is the first row.
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6
关于如何实现这一点有什么想法吗?
一个dplyr
和purrr
方法可以是:
bind_cols(df, df %>%
mutate_all(~ strsplit(., " ", fixed = TRUE)) %>%
mutate_all(~ map2_int(.x = ., .y = .[1], ~ sum(.x != .y))) %>%
transmute(e = rowSums(select(., everything()))))
a b c d e
<chr> <chr> <chr> <chr> <dbl>
1 1 2 2 1 1 1 1 1 0
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6
或仅使用 dplyr
:
bind_cols(df, df %>%
mutate_all(~ rowSums(drop(attr(adist(., first(.), count = TRUE), "counts")))) %>%
transmute(e = rowSums(select(., everything()))))
您还可以这样做:
library(dplyr)
library(purrr)
df %>%
mutate(e = pmap(., ~toString(c(...)) %>% charToRaw),
e = map_dbl(e, ~ sum(.x != e[[1]])))
# A tibble: 4 x 5
a b c d e
<chr> <chr> <chr> <chr> <dbl>
1 1 2 2 1 1 1 1 1 0
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6
您可以将基础 R 矩阵与 stringdist
程序包一起使用,以获得更简单、更灵活的解决方案(即,如果您的数据包含更复杂的字符串):
library(stringdist)
m <- t(df)
df$e <- colSums(matrix(stringdist(m[,1], m), ncol(df)))
输出:
a b c d e
<chr> <chr> <chr> <chr> <dbl>
1 1 2 2 1 1 1 1 1 0
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6
我有一个包含数千行和几列的数据框,我需要在其中计算字符变量从第一行到其他每一行的变化(row1–row2、row1–row3、row1–row4,...)和将更改总数输出到新列中。
df <- data_frame(
a = c("1 2", "1 2", "2 2", "2 2"),
b = c("2 1", "1 2", "1 2","1 2"),
c = c("1 1", "1 2", "2 1","2 2"),
d = c("1 1", "1 1", "2 1","2 1")
)
df
a b c d
<chr> <chr> <chr> <chr>
1 1 2 2 1 1 1 1 1
2 1 2 1 2 1 2 1 1
3 2 2 1 2 2 1 2 1
4 2 2 1 2 2 2 2 1
我想统计第1行到第2行、第1行到第3行等每个元素之间的字符不匹配情况。这样我就明白了:
a b c d e
1 1 2 2 1 1 1 1 1 NA #No mismatches to count since this is the first row.
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6
关于如何实现这一点有什么想法吗?
一个dplyr
和purrr
方法可以是:
bind_cols(df, df %>%
mutate_all(~ strsplit(., " ", fixed = TRUE)) %>%
mutate_all(~ map2_int(.x = ., .y = .[1], ~ sum(.x != .y))) %>%
transmute(e = rowSums(select(., everything()))))
a b c d e
<chr> <chr> <chr> <chr> <dbl>
1 1 2 2 1 1 1 1 1 0
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6
或仅使用 dplyr
:
bind_cols(df, df %>%
mutate_all(~ rowSums(drop(attr(adist(., first(.), count = TRUE), "counts")))) %>%
transmute(e = rowSums(select(., everything()))))
您还可以这样做:
library(dplyr)
library(purrr)
df %>%
mutate(e = pmap(., ~toString(c(...)) %>% charToRaw),
e = map_dbl(e, ~ sum(.x != e[[1]])))
# A tibble: 4 x 5
a b c d e
<chr> <chr> <chr> <chr> <dbl>
1 1 2 2 1 1 1 1 1 0
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6
您可以将基础 R 矩阵与 stringdist
程序包一起使用,以获得更简单、更灵活的解决方案(即,如果您的数据包含更复杂的字符串):
library(stringdist)
m <- t(df)
df$e <- colSums(matrix(stringdist(m[,1], m), ncol(df)))
输出:
a b c d e
<chr> <chr> <chr> <chr> <dbl>
1 1 2 2 1 1 1 1 1 0
2 1 2 1 2 1 2 1 1 3
3 2 2 1 2 2 1 2 1 5
4 2 2 1 2 2 2 2 1 6