设置字符数据的操作(整数字符串)
Set Operations on Character Data (string of integers)
有谁知道如何设计一种快速计算两列相对重叠的方法?我想知道集合 'b' 中有多少 'a' 的元素。理想情况下,生成一列 'c' 来存储每行的这些比较值。真的卡在这个了..
b <- c("20", "1, 8, 19, 20, 22, 23, 28, 34, 41",
"3, 8, 10, 11, 18, 20, 26, 37",
"1, 3, 6, 18, 21, 35", "NA", "1, 21, 33", "14, 37",
"4, 14, 18, 23, 33, 37, 40", "14",
"4, 14, 20, 23, 33, 37, 40",
"2, 3, 5, 7, 8, 10, 14, 16, 18, 23, 25, 34, 40",
"6, 8, 10, 14, 19, 29, 33, 35, 36, 39, 41",
"1, 20", "1, 28, 36", "14",
"1, 6, 33, 12, 39", "28",
"1, 6, 11, 13, 18, 19, 21, 28, 33, 35, 36, 39",
"35, 40", "20", "20, 38", "6, 8, 19, 22, 29, 32, 33, 34, 40",
"1, 10, 21, 25, 33, 35, 36, 39, 40", "36")
a <- c("14", "10", "8, 39", "26, 39", "14, 20", "33, 36", "14",
"NA", "8, 39", "33, 36", "8, 39", "1, 36", "10", "28, 33",
"14, 20", "33, 40", "28, 34", "1, 36",
"8, 39", "20", "14, 20", "29, 33", "36", "14")
df <- data.frame(a, b)
df$a <- as.character(df$a)
df$b <- as.character(df$b)
此函数适用于第 18 行,但不容易通过 sapply 或等效函数进行扩展。
length(intersect(as.numeric(unlist(strsplit(df$a[18], ", "))),
as.numeric(unlist(strsplit(df$b[18], ", "))))) /
length(as.numeric(unlist(strsplit(df$b[18], ", "))))
# gives
[1] 0.1666667
length(intersect(as.numeric(unlist(strsplit(df$a[5], ", "))),
as.numeric(unlist(strsplit(df$b[5], ", "))))) /
length(as.numeric(unlist(strsplit(df$b[5], ", "))))
# gives
[1] 0
Warning messages:
1: In intersect(as.numeric(unlist(strsplit(df$a[5], ", "))), as.numeric(unlist(strsplit(df$b[5], :
NAs introduced by coercion
2: NAs introduced by coercion
我不明白为什么需要使用 as.numeric
进行转换。就是那个给你警告的。 "NA" 在您的数据框中被视为字符值,这是一个无法转换为数字的字符值。
请注意,警告不是错误,因此您的代码实际上也适用于第 5 行(除非您期望 NA)。
我会执行以下操作:
getCounts <- function(x,y){
x <- strsplit(x,", ")[[1]]
y <- strsplit(y,", ")[[1]]
mean(y %in% x)
}
# gives
> getCounts(df$a[5],df$b[5])
[1] 0
这基本上就是您所做的,但写得更清楚一些,并使用 mean(..%in%..)
而不是 length(intersect(..,..))/...
。
为了对向量 a 和 b 都执行此操作,您可以使用 mapply
:
out <- mapply(getCounts,df$a, df$b)
有谁知道如何设计一种快速计算两列相对重叠的方法?我想知道集合 'b' 中有多少 'a' 的元素。理想情况下,生成一列 'c' 来存储每行的这些比较值。真的卡在这个了..
b <- c("20", "1, 8, 19, 20, 22, 23, 28, 34, 41",
"3, 8, 10, 11, 18, 20, 26, 37",
"1, 3, 6, 18, 21, 35", "NA", "1, 21, 33", "14, 37",
"4, 14, 18, 23, 33, 37, 40", "14",
"4, 14, 20, 23, 33, 37, 40",
"2, 3, 5, 7, 8, 10, 14, 16, 18, 23, 25, 34, 40",
"6, 8, 10, 14, 19, 29, 33, 35, 36, 39, 41",
"1, 20", "1, 28, 36", "14",
"1, 6, 33, 12, 39", "28",
"1, 6, 11, 13, 18, 19, 21, 28, 33, 35, 36, 39",
"35, 40", "20", "20, 38", "6, 8, 19, 22, 29, 32, 33, 34, 40",
"1, 10, 21, 25, 33, 35, 36, 39, 40", "36")
a <- c("14", "10", "8, 39", "26, 39", "14, 20", "33, 36", "14",
"NA", "8, 39", "33, 36", "8, 39", "1, 36", "10", "28, 33",
"14, 20", "33, 40", "28, 34", "1, 36",
"8, 39", "20", "14, 20", "29, 33", "36", "14")
df <- data.frame(a, b)
df$a <- as.character(df$a)
df$b <- as.character(df$b)
此函数适用于第 18 行,但不容易通过 sapply 或等效函数进行扩展。
length(intersect(as.numeric(unlist(strsplit(df$a[18], ", "))),
as.numeric(unlist(strsplit(df$b[18], ", "))))) /
length(as.numeric(unlist(strsplit(df$b[18], ", "))))
# gives
[1] 0.1666667
length(intersect(as.numeric(unlist(strsplit(df$a[5], ", "))),
as.numeric(unlist(strsplit(df$b[5], ", "))))) /
length(as.numeric(unlist(strsplit(df$b[5], ", "))))
# gives
[1] 0
Warning messages:
1: In intersect(as.numeric(unlist(strsplit(df$a[5], ", "))), as.numeric(unlist(strsplit(df$b[5], :
NAs introduced by coercion
2: NAs introduced by coercion
我不明白为什么需要使用 as.numeric
进行转换。就是那个给你警告的。 "NA" 在您的数据框中被视为字符值,这是一个无法转换为数字的字符值。
请注意,警告不是错误,因此您的代码实际上也适用于第 5 行(除非您期望 NA)。
我会执行以下操作:
getCounts <- function(x,y){
x <- strsplit(x,", ")[[1]]
y <- strsplit(y,", ")[[1]]
mean(y %in% x)
}
# gives
> getCounts(df$a[5],df$b[5])
[1] 0
这基本上就是您所做的,但写得更清楚一些,并使用 mean(..%in%..)
而不是 length(intersect(..,..))/...
。
为了对向量 a 和 b 都执行此操作,您可以使用 mapply
:
out <- mapply(getCounts,df$a, df$b)