R studio:有没有一种方法可以计算具有单个和多个感兴趣变量的 2 个时间序列之间的余弦和欧氏距离?
R studio: Is there a way to calculate the cosine & euclidean distance between 2 time series with a single & multiple variables of interest?
假设我有城市 A、城市 B、城市 C 和城市 D 的时间序列数据,如下所示:
+------------+--------+--------+--------+--------+
| Dates | City A | City B | City C | City D |
+------------+--------+--------+--------+--------+
| 2020-01-01 | 10 | 20 | 20 | 30 |
+------------+--------+--------+--------+--------+
| 2020-01-02 | 20 | 30 | 30 | 40 |
+------------+--------+--------+--------+--------+
| 2020-01-03 | 30 | 40 | 20 | 20 |
+------------+--------+--------+--------+--------+
| 2020-01-04 | 40 | 20 | 15 | 40 |
+------------+--------+--------+--------+--------+
| 2020-01-05 | 50 | 40 | 18 | 10 |
+------------+--------+--------+--------+--------+
| 2020-01-06 | 60 | 50 | 20 | 15 |
+------------+--------+--------+--------+--------+
| 2020-01-07 | 70 | 60 | 40 | 72 |
+------------+--------+--------+--------+--------+
| 2020-01-08 | 50 | 80 | 60 | 90 |
+------------+--------+--------+--------+--------+
| 2020-01-09 | 30 | 30 | 90 | 17 |
+------------+--------+--------+--------+--------+
| 2020-01-10 | 60 | 50 | 18 | 15 |
+------------+--------+--------+--------+--------+
我想通过对齐时间索引分别计算A&B、A&C、A&D之间的余弦和欧式距离。
例如,要计算城市 A 和城市 B 之间的欧几里德距离,我会计算他们的 2020-01-01 数据、2020-01-02 数据、2020-01-03 数据的欧几里德距离 .. . 然后将所有这些加在一起,得到城市 A 和城市 B 之间的最终欧氏距离。
编写执行此任务的 R 函数的优雅方法是什么?
然后,如果我的数据开始包含更多变量:
+------------+-------+------+------+------+
| Dates | City | Var1 | Var2 | Var3 |
+------------+-------+------+------+------+
| 2020-01-01 | A | 20 | 200 | 5 |
+------------+-------+------+------+------+
| 2020-01-02 | A | 30 | 300 | 3 |
+------------+-------+------+------+------+
| 2020-01-03 | A | 40 | 220 | 4 |
+------------+-------+------+------+------+
| 2020-01-04 | A | 20 | 150 | 2 |
+------------+-------+------+------+------+
| 2020-01-05 | A | 40 | 180 | 5 |
+------------+-------+------+------+------+
| 2020-01-01 | B | 50 | 200 | 6 |
+------------+-------+------+------+------+
| 2020-01-02 | B | 60 | 400 | 7 |
+------------+-------+------+------+------+
| 2020-01-03 | B | 80 | 600 | 8 |
+------------+-------+------+------+------+
| 2020-01-04 | B | 30 | 900 | 4 |
+------------+-------+------+------+------+
| 2020-01-05 | B | 50 | 180 | 2 |
+------------+-------+------+------+------+
| 2020-01-01 | C | 20 | 230 | 3 |
+------------+-------+------+------+------+
| 2020-01-02 | C | 30 | 340 | 5 |
+------------+-------+------+------+------+
| 2020-01-03 | C | 40 | 230 | 3 |
+------------+-------+------+------+------+
| 2020-01-04 | C | 20 | 120 | 5 |
+------------+-------+------+------+------+
| 2020-01-05 | C | 40 | 120 | 4 |
+------------+-------+------+------+------+
| 2020-01-01 | D | 20 | 400 | 5 |
+------------+-------+------+------+------+
| 2020-01-02 | D | 30 | 500 | 6 |
+------------+-------+------+------+------+
| 2020-01-03 | D | 10 | 600 | 7 |
+------------+-------+------+------+------+
| 2020-01-04 | D | 50 | 3O0 | 7 |
+------------+-------+------+------+------+
| 2020-01-05 | D | 20 | 300 | 4 |
+------------+-------+------+------+------+
使用上面的相同示例,要计算城市 A 和城市 B 之间的欧式距离,我将计算他们的 2020-01-01 数据、2020-01-02 数据、2020-01-03 的欧式距离变量 1 的数据 -> 对变量 2 和变量 3 重复此过程。然后,最后将所有这些加在一起,以获得城市 A 和城市 B 之间的总欧氏距离。
我想知道这样的距离计算在技术上是否可行,如果可行,我如何编写一个 R 函数,针对欧几里德和余弦距离执行这些任务,对于 1 个感兴趣的单个变量和多个兴趣变量分别是?
非常感谢您的帮助!
我编辑了 post 以包含余弦距离。首先,让我们制作上面的第一个数据集。
dat <- tibble::tribble(~Dates, ~`City A`, ~`City B`, ~`City C`, ~`City D`,
"2020-01-01" , 10 , 20 , 20 , 30,
"2020-01-02" , 20 , 30 , 30 , 40,
"2020-01-03" , 30 , 40 , 20 , 20,
"2020-01-04" , 40 , 20 , 15 , 40,
"2020-01-05" , 50 , 40 , 18 , 10,
"2020-01-06" , 60 , 50 , 20 , 15,
"2020-01-07" , 70 , 60 , 40 , 72,
"2020-01-08" , 50 , 80 , 60 , 90,
"2020-01-09" , 30 , 30 , 90 , 17,
"2020-01-10" , 60 , 50 , 18 , 15)
dat$Dates <- lubridate::ymd(dat$Dates)
然后,我们可以将数据重新排列到列中的变量中,并定义将创建距离的函数。因为我们要将它与 outer()
一起使用,所以我将使用两个参数作为 X
矩阵的两个不同行。
X <- dat %>% select(-Dates) %>% as.matrix %>% t
edfun <- function(x,y){
sum(sqrt((X[x, ] - X[y,])^2))
}
现在,我们可以计算距离并打印它们:
o1 <- outer(1:nrow(X), 1:nrow(X), Vectorize(edfun))
rownames(o1) <- colnames(o1) <- rownames(X)
o1
# City A City B City C City D
# City A 0 120 269 235
# City B 120 0 209 195
# City C 269 209 0 196
# City D 235 195 196 0
现在,我们可以制作一个余弦距离函数并估计这些距离。
cdfun <- function(x,y){
num <- sum(X[x,]*X[y, ])
d1 <- sqrt(sum(X[x, ]^2))
d2 <- sqrt(sum(X[y, ]^2))
num/(d1*d2)
}
o1a <- outer(1:nrow(X), 1:nrow(X), Vectorize(cdfun))
rownames(o1a) <- colnames(o1a) <- rownames(X)
o1a
# City A City B City C City D
# City A 1.0000000 0.9521640 0.7400186 0.7913705
# City B 0.9521640 1.0000000 0.8109673 0.8805258
# City C 0.7400186 0.8109673 1.0000000 0.7674460
# City D 0.7913705 0.8805258 0.7674460 1.0000000
我们可以对更长的数据做同样的事情:
dat2 <- tibble::tribble( ~Dates , ~City , ~Var1, ~Var2, ~Var3,
"2020-01-01" , "A" , 20 , 200 , 5 ,
"2020-01-02" , "A" , 30 , 300 , 3 ,
"2020-01-03" , "A" , 40 , 220 , 4 ,
"2020-01-04" , "A" , 20 , 150 , 2 ,
"2020-01-05" , "A" , 40 , 180 , 5 ,
"2020-01-01" , "B" , 50 , 200 , 6 ,
"2020-01-02" , "B" , 60 , 400 , 7 ,
"2020-01-03" , "B" , 80 , 600 , 8 ,
"2020-01-04" , "B" , 30 , 900 , 4 ,
"2020-01-05" , "B" , 50 , 180 , 2 ,
"2020-01-01" , "C" , 20 , 230 , 3 ,
"2020-01-02" , "C" , 30 , 340 , 5 ,
"2020-01-03" , "C" , 40 , 230 , 3 ,
"2020-01-04" , "C" , 20 , 120 , 5 ,
"2020-01-05" , "C" , 40 , 120 , 4 ,
"2020-01-01" , "D" , 20 , 400 , 5 ,
"2020-01-02" , "D" , 30 , 500 , 6 ,
"2020-01-03" , "D" , 10 , 600 , 7 ,
"2020-01-04" , "D" , 50 , 300 , 7 ,
"2020-01-05" , "D" , 20 , 300 , 4 )
d2w <- dat2 %>%
pivot_wider(names_from="Dates",
values_from=c("Var1", "Var2", "Var3"))
X2 <- d2w %>% select(-City) %>% as.matrix
rownames(X2) <- paste0("City ", d2w$City)
edfun2 <- function(x,y){
sum(sqrt((X2[x, ] - X2[y,])^2))
}
o2 <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(edfun2))
rownames(o2) <- colnames(o2) <- rownames(X2)
o2
# City A City B City C City D
# City A 0 1364 179 1142
# City B 1364 0 1433 1208
# City C 179 1433 0 1149
# City D 1142 1208 1149 0
cdfun2 <- function(x,y){
num <- sum(X2[x,]*X2[y, ])
d1 <- sqrt(sum(X2[x, ]^2))
d2 <- sqrt(sum(X2[y, ]^2))
num/(d1*d2)
}
o2a <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(cdfun2))
rownames(o2a) <- colnames(o2a) <- rownames(X2)
o2a
# City A City B City C City D
# City A 1.0000000 0.8051685 0.9861522 0.9742238
# City B 0.8051685 1.0000000 0.7617596 0.8338688
# City C 0.9861522 0.7617596 1.0000000 0.9637144
# City D 0.9742238 0.8338688 0.9637144 1.0000000
假设我有城市 A、城市 B、城市 C 和城市 D 的时间序列数据,如下所示:
+------------+--------+--------+--------+--------+
| Dates | City A | City B | City C | City D |
+------------+--------+--------+--------+--------+
| 2020-01-01 | 10 | 20 | 20 | 30 |
+------------+--------+--------+--------+--------+
| 2020-01-02 | 20 | 30 | 30 | 40 |
+------------+--------+--------+--------+--------+
| 2020-01-03 | 30 | 40 | 20 | 20 |
+------------+--------+--------+--------+--------+
| 2020-01-04 | 40 | 20 | 15 | 40 |
+------------+--------+--------+--------+--------+
| 2020-01-05 | 50 | 40 | 18 | 10 |
+------------+--------+--------+--------+--------+
| 2020-01-06 | 60 | 50 | 20 | 15 |
+------------+--------+--------+--------+--------+
| 2020-01-07 | 70 | 60 | 40 | 72 |
+------------+--------+--------+--------+--------+
| 2020-01-08 | 50 | 80 | 60 | 90 |
+------------+--------+--------+--------+--------+
| 2020-01-09 | 30 | 30 | 90 | 17 |
+------------+--------+--------+--------+--------+
| 2020-01-10 | 60 | 50 | 18 | 15 |
+------------+--------+--------+--------+--------+
我想通过对齐时间索引分别计算A&B、A&C、A&D之间的余弦和欧式距离。
例如,要计算城市 A 和城市 B 之间的欧几里德距离,我会计算他们的 2020-01-01 数据、2020-01-02 数据、2020-01-03 数据的欧几里德距离 .. . 然后将所有这些加在一起,得到城市 A 和城市 B 之间的最终欧氏距离。
编写执行此任务的 R 函数的优雅方法是什么?
然后,如果我的数据开始包含更多变量:
+------------+-------+------+------+------+
| Dates | City | Var1 | Var2 | Var3 |
+------------+-------+------+------+------+
| 2020-01-01 | A | 20 | 200 | 5 |
+------------+-------+------+------+------+
| 2020-01-02 | A | 30 | 300 | 3 |
+------------+-------+------+------+------+
| 2020-01-03 | A | 40 | 220 | 4 |
+------------+-------+------+------+------+
| 2020-01-04 | A | 20 | 150 | 2 |
+------------+-------+------+------+------+
| 2020-01-05 | A | 40 | 180 | 5 |
+------------+-------+------+------+------+
| 2020-01-01 | B | 50 | 200 | 6 |
+------------+-------+------+------+------+
| 2020-01-02 | B | 60 | 400 | 7 |
+------------+-------+------+------+------+
| 2020-01-03 | B | 80 | 600 | 8 |
+------------+-------+------+------+------+
| 2020-01-04 | B | 30 | 900 | 4 |
+------------+-------+------+------+------+
| 2020-01-05 | B | 50 | 180 | 2 |
+------------+-------+------+------+------+
| 2020-01-01 | C | 20 | 230 | 3 |
+------------+-------+------+------+------+
| 2020-01-02 | C | 30 | 340 | 5 |
+------------+-------+------+------+------+
| 2020-01-03 | C | 40 | 230 | 3 |
+------------+-------+------+------+------+
| 2020-01-04 | C | 20 | 120 | 5 |
+------------+-------+------+------+------+
| 2020-01-05 | C | 40 | 120 | 4 |
+------------+-------+------+------+------+
| 2020-01-01 | D | 20 | 400 | 5 |
+------------+-------+------+------+------+
| 2020-01-02 | D | 30 | 500 | 6 |
+------------+-------+------+------+------+
| 2020-01-03 | D | 10 | 600 | 7 |
+------------+-------+------+------+------+
| 2020-01-04 | D | 50 | 3O0 | 7 |
+------------+-------+------+------+------+
| 2020-01-05 | D | 20 | 300 | 4 |
+------------+-------+------+------+------+
使用上面的相同示例,要计算城市 A 和城市 B 之间的欧式距离,我将计算他们的 2020-01-01 数据、2020-01-02 数据、2020-01-03 的欧式距离变量 1 的数据 -> 对变量 2 和变量 3 重复此过程。然后,最后将所有这些加在一起,以获得城市 A 和城市 B 之间的总欧氏距离。
我想知道这样的距离计算在技术上是否可行,如果可行,我如何编写一个 R 函数,针对欧几里德和余弦距离执行这些任务,对于 1 个感兴趣的单个变量和多个兴趣变量分别是?
非常感谢您的帮助!
我编辑了 post 以包含余弦距离。首先,让我们制作上面的第一个数据集。
dat <- tibble::tribble(~Dates, ~`City A`, ~`City B`, ~`City C`, ~`City D`,
"2020-01-01" , 10 , 20 , 20 , 30,
"2020-01-02" , 20 , 30 , 30 , 40,
"2020-01-03" , 30 , 40 , 20 , 20,
"2020-01-04" , 40 , 20 , 15 , 40,
"2020-01-05" , 50 , 40 , 18 , 10,
"2020-01-06" , 60 , 50 , 20 , 15,
"2020-01-07" , 70 , 60 , 40 , 72,
"2020-01-08" , 50 , 80 , 60 , 90,
"2020-01-09" , 30 , 30 , 90 , 17,
"2020-01-10" , 60 , 50 , 18 , 15)
dat$Dates <- lubridate::ymd(dat$Dates)
然后,我们可以将数据重新排列到列中的变量中,并定义将创建距离的函数。因为我们要将它与 outer()
一起使用,所以我将使用两个参数作为 X
矩阵的两个不同行。
X <- dat %>% select(-Dates) %>% as.matrix %>% t
edfun <- function(x,y){
sum(sqrt((X[x, ] - X[y,])^2))
}
现在,我们可以计算距离并打印它们:
o1 <- outer(1:nrow(X), 1:nrow(X), Vectorize(edfun))
rownames(o1) <- colnames(o1) <- rownames(X)
o1
# City A City B City C City D
# City A 0 120 269 235
# City B 120 0 209 195
# City C 269 209 0 196
# City D 235 195 196 0
现在,我们可以制作一个余弦距离函数并估计这些距离。
cdfun <- function(x,y){
num <- sum(X[x,]*X[y, ])
d1 <- sqrt(sum(X[x, ]^2))
d2 <- sqrt(sum(X[y, ]^2))
num/(d1*d2)
}
o1a <- outer(1:nrow(X), 1:nrow(X), Vectorize(cdfun))
rownames(o1a) <- colnames(o1a) <- rownames(X)
o1a
# City A City B City C City D
# City A 1.0000000 0.9521640 0.7400186 0.7913705
# City B 0.9521640 1.0000000 0.8109673 0.8805258
# City C 0.7400186 0.8109673 1.0000000 0.7674460
# City D 0.7913705 0.8805258 0.7674460 1.0000000
我们可以对更长的数据做同样的事情:
dat2 <- tibble::tribble( ~Dates , ~City , ~Var1, ~Var2, ~Var3,
"2020-01-01" , "A" , 20 , 200 , 5 ,
"2020-01-02" , "A" , 30 , 300 , 3 ,
"2020-01-03" , "A" , 40 , 220 , 4 ,
"2020-01-04" , "A" , 20 , 150 , 2 ,
"2020-01-05" , "A" , 40 , 180 , 5 ,
"2020-01-01" , "B" , 50 , 200 , 6 ,
"2020-01-02" , "B" , 60 , 400 , 7 ,
"2020-01-03" , "B" , 80 , 600 , 8 ,
"2020-01-04" , "B" , 30 , 900 , 4 ,
"2020-01-05" , "B" , 50 , 180 , 2 ,
"2020-01-01" , "C" , 20 , 230 , 3 ,
"2020-01-02" , "C" , 30 , 340 , 5 ,
"2020-01-03" , "C" , 40 , 230 , 3 ,
"2020-01-04" , "C" , 20 , 120 , 5 ,
"2020-01-05" , "C" , 40 , 120 , 4 ,
"2020-01-01" , "D" , 20 , 400 , 5 ,
"2020-01-02" , "D" , 30 , 500 , 6 ,
"2020-01-03" , "D" , 10 , 600 , 7 ,
"2020-01-04" , "D" , 50 , 300 , 7 ,
"2020-01-05" , "D" , 20 , 300 , 4 )
d2w <- dat2 %>%
pivot_wider(names_from="Dates",
values_from=c("Var1", "Var2", "Var3"))
X2 <- d2w %>% select(-City) %>% as.matrix
rownames(X2) <- paste0("City ", d2w$City)
edfun2 <- function(x,y){
sum(sqrt((X2[x, ] - X2[y,])^2))
}
o2 <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(edfun2))
rownames(o2) <- colnames(o2) <- rownames(X2)
o2
# City A City B City C City D
# City A 0 1364 179 1142
# City B 1364 0 1433 1208
# City C 179 1433 0 1149
# City D 1142 1208 1149 0
cdfun2 <- function(x,y){
num <- sum(X2[x,]*X2[y, ])
d1 <- sqrt(sum(X2[x, ]^2))
d2 <- sqrt(sum(X2[y, ]^2))
num/(d1*d2)
}
o2a <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(cdfun2))
rownames(o2a) <- colnames(o2a) <- rownames(X2)
o2a
# City A City B City C City D
# City A 1.0000000 0.8051685 0.9861522 0.9742238
# City B 0.8051685 1.0000000 0.7617596 0.8338688
# City C 0.9861522 0.7617596 1.0000000 0.9637144
# City D 0.9742238 0.8338688 0.9637144 1.0000000