从具有分类变量的多个表计算欧氏距离
euclidian distance calculation from multiple tables with categorical variables
我有两个如下所示的数据框:
df1 <- data.frame(geneID=c("gene1","gene2","gene3","gene4",
"gene5","gene6","gene7","gene8","gene9","gene10"),
patient_ID=c(700,0,3,387,30724,1,609,4,0,1729))
head(df1)
geneID patient_ID
1 gene1 700
2 gene2 0
3 gene3 3
4 gene4 387
5 gene5 30724
6 gene6 1
df2 <- data.frame(component1=c("gene1","gene2","gene3","gene4","gene5"),
component2=c("gene2","gene4","gene5","gene10","gene9"))
head(df2)
component1 component2
1 gene1 gene2
2 gene2 gene4
3 gene3 gene5
4 gene4 gene10
5 gene5 gene9
我想生成一个数据框,它使用来自 df1 的基因值并计算来自 df2 的组件 1 和组件 2 之间的欧几里得距离。例如,对于 gene3 和 gene5 对,df3 中的输出应使用以下等式计算:
val = sqrt((gene3)^2+(gene5)^2) =sqrt(700^2+30724^2)
我的最终目标是table这样:
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
1 gene1 0 0 0 0 0 0 0 0 0 0
2 gene2 val 0 0 0 0 0 0 0 0 0
3 gene3 0 0 0 0 0 0 0 0 0 0
4 gene4 0 val 0 0 0 0 0 0 0 val
5 gene5 0 0 val 0 0 0 0 0 val 0
6 gene6 0 0 0 0 0 0 0 0 0 0
7 gene7 0 0 0 0 0 0 0 0 0 0
8 gene8 0 0 0 0 0 0 0 0 0 0
9 gene9 0 0 0 0 val 0 0 0 0 0
10 gene10 0 0 0 val 0 0 0 0 0 0
非常感谢任何帮助和建议。
谢谢!
奥尔哈
试试这个。
library(dplyr)
library(tidyr) # pivot_wider
left_join(df2, select(df1, geneID, x = patient_ID), by = c("component1" = "geneID")) %>%
left_join(select(df1, geneID, y = patient_ID), by = c("component2" = "geneID")) %>%
mutate(val = sqrt(x^2 + y^2)) %>%
complete(component1, component2) %>%
pivot_wider(component1, names_from = component2, values_from = val)
# # A tibble: 5 x 6
# component1 gene10 gene2 gene4 gene5 gene9
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 gene1 NA 700 NA NA NA
# 2 gene2 NA NA 387 NA NA
# 3 gene3 NA NA NA 30724. NA
# 4 gene4 1772. NA NA NA NA
# 5 gene5 NA NA NA NA 30724
我没有尝试将其扩展为在两个轴上都有 1-10,因为您的 df2
建议特定的配对。您或许可以使用 tidyr::complete
或 tidyr::expand
来获得完整扩展。
我有两个如下所示的数据框:
df1 <- data.frame(geneID=c("gene1","gene2","gene3","gene4",
"gene5","gene6","gene7","gene8","gene9","gene10"),
patient_ID=c(700,0,3,387,30724,1,609,4,0,1729))
head(df1)
geneID patient_ID
1 gene1 700
2 gene2 0
3 gene3 3
4 gene4 387
5 gene5 30724
6 gene6 1
df2 <- data.frame(component1=c("gene1","gene2","gene3","gene4","gene5"),
component2=c("gene2","gene4","gene5","gene10","gene9"))
head(df2)
component1 component2
1 gene1 gene2
2 gene2 gene4
3 gene3 gene5
4 gene4 gene10
5 gene5 gene9
我想生成一个数据框,它使用来自 df1 的基因值并计算来自 df2 的组件 1 和组件 2 之间的欧几里得距离。例如,对于 gene3 和 gene5 对,df3 中的输出应使用以下等式计算:
val = sqrt((gene3)^2+(gene5)^2) =sqrt(700^2+30724^2)
我的最终目标是table这样:
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
1 gene1 0 0 0 0 0 0 0 0 0 0
2 gene2 val 0 0 0 0 0 0 0 0 0
3 gene3 0 0 0 0 0 0 0 0 0 0
4 gene4 0 val 0 0 0 0 0 0 0 val
5 gene5 0 0 val 0 0 0 0 0 val 0
6 gene6 0 0 0 0 0 0 0 0 0 0
7 gene7 0 0 0 0 0 0 0 0 0 0
8 gene8 0 0 0 0 0 0 0 0 0 0
9 gene9 0 0 0 0 val 0 0 0 0 0
10 gene10 0 0 0 val 0 0 0 0 0 0
非常感谢任何帮助和建议。
谢谢!
奥尔哈
试试这个。
library(dplyr)
library(tidyr) # pivot_wider
left_join(df2, select(df1, geneID, x = patient_ID), by = c("component1" = "geneID")) %>%
left_join(select(df1, geneID, y = patient_ID), by = c("component2" = "geneID")) %>%
mutate(val = sqrt(x^2 + y^2)) %>%
complete(component1, component2) %>%
pivot_wider(component1, names_from = component2, values_from = val)
# # A tibble: 5 x 6
# component1 gene10 gene2 gene4 gene5 gene9
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 gene1 NA 700 NA NA NA
# 2 gene2 NA NA 387 NA NA
# 3 gene3 NA NA NA 30724. NA
# 4 gene4 1772. NA NA NA NA
# 5 gene5 NA NA NA NA 30724
我没有尝试将其扩展为在两个轴上都有 1-10,因为您的 df2
建议特定的配对。您或许可以使用 tidyr::complete
或 tidyr::expand
来获得完整扩展。