R studio:有没有一种方法可以计算具有单个和多个感兴趣变量的 2 个时间序列之间的余弦和欧氏距离?

R studio: Is there a way to calculate the cosine & euclidean distance between 2 time series with a single & multiple variables of interest?

假设我有城市 A、城市 B、城市 C 和城市 D 的时间序列数据,如下所示:

+------------+--------+--------+--------+--------+
| Dates      | City A | City B | City C | City D |
+------------+--------+--------+--------+--------+
| 2020-01-01 | 10     | 20     | 20     | 30     |
+------------+--------+--------+--------+--------+
| 2020-01-02 | 20     | 30     | 30     | 40     |
+------------+--------+--------+--------+--------+
| 2020-01-03 | 30     | 40     | 20     | 20     |
+------------+--------+--------+--------+--------+
| 2020-01-04 | 40     | 20     | 15     | 40     |
+------------+--------+--------+--------+--------+
| 2020-01-05 | 50     | 40     | 18     | 10     |
+------------+--------+--------+--------+--------+
| 2020-01-06 | 60     | 50     | 20     | 15     |
+------------+--------+--------+--------+--------+
| 2020-01-07 | 70     | 60     | 40     | 72     |
+------------+--------+--------+--------+--------+
| 2020-01-08 | 50     | 80     | 60     | 90     |
+------------+--------+--------+--------+--------+
| 2020-01-09 | 30     | 30     | 90     | 17     |
+------------+--------+--------+--------+--------+
| 2020-01-10 | 60     | 50     | 18     | 15     |
+------------+--------+--------+--------+--------+

我想通过对齐时间索引分别计算A&B、A&C、A&D之间的余弦和欧式距离。

例如,要计算城市 A 和城市 B 之间的欧几里德距离,我会计算他们的 2020-01-01 数据、2020-01-02 数据、2020-01-03 数据的欧几里德距离 .. . 然后将所有这些加在一起,得到城市 A 和城市 B 之间的最终欧氏距离。

编写执行此任务的 R 函数的优雅方法是什么?

然后,如果我的数据开始包含更多变量:

+------------+-------+------+------+------+
| Dates      | City  | Var1 | Var2 | Var3 |
+------------+-------+------+------+------+
| 2020-01-01 | A     | 20   | 200  | 5    |
+------------+-------+------+------+------+
| 2020-01-02 | A     | 30   | 300  | 3    |
+------------+-------+------+------+------+
| 2020-01-03 | A     | 40   | 220  | 4    |
+------------+-------+------+------+------+
| 2020-01-04 | A     | 20   | 150  | 2    |
+------------+-------+------+------+------+
| 2020-01-05 | A     | 40   | 180  | 5    |
+------------+-------+------+------+------+
| 2020-01-01 | B     | 50   | 200  | 6    |
+------------+-------+------+------+------+
| 2020-01-02 | B     | 60   | 400  | 7    |
+------------+-------+------+------+------+
| 2020-01-03 | B     | 80   | 600  | 8    |
+------------+-------+------+------+------+
| 2020-01-04 | B     | 30   | 900  | 4    |
+------------+-------+------+------+------+
| 2020-01-05 | B     | 50   | 180  | 2    |
+------------+-------+------+------+------+
| 2020-01-01 | C     | 20   | 230  | 3    |
+------------+-------+------+------+------+
| 2020-01-02 | C     | 30   | 340  | 5    |
+------------+-------+------+------+------+
| 2020-01-03 | C     | 40   | 230  | 3    |
+------------+-------+------+------+------+
| 2020-01-04 | C     | 20   | 120  | 5    |
+------------+-------+------+------+------+
| 2020-01-05 | C     | 40   | 120  | 4    |
+------------+-------+------+------+------+
| 2020-01-01 | D     | 20   | 400  | 5    |
+------------+-------+------+------+------+
| 2020-01-02 | D     | 30   | 500  | 6    |
+------------+-------+------+------+------+
| 2020-01-03 | D     | 10   | 600  | 7    |
+------------+-------+------+------+------+
| 2020-01-04 | D     | 50   | 3O0  | 7    |
+------------+-------+------+------+------+
| 2020-01-05 | D     | 20   | 300  | 4    |
+------------+-------+------+------+------+

使用上面的相同示例,要计算城市 A 和城市 B 之间的欧式距离,我将计算他们的 2020-01-01 数据、2020-01-02 数据、2020-01-03 的欧式距离变量 1 的数据 -> 对变量 2 和变量 3 重复此过程。然后,最后将所有这些加在一起,以获得城市 A 和城市 B 之间的总欧氏距离。

我想知道这样的距离计算在技术上是否可行,如果可行,我如何编写一个 R 函数,针对欧几里德和余弦距离执行这些任务,对于 1 个感兴趣的单个变量和多个兴趣变量分别是?

非常感谢您的帮助!

我编辑了 post 以包含余弦距离。首先,让我们制作上面的第一个数据集。

dat <- tibble::tribble(~Dates, ~`City A`, ~`City B`,  ~`City C`, ~`City D`,
                       "2020-01-01" ,  10     ,  20     , 20     , 30, 
                       "2020-01-02" ,  20     ,  30     , 30     , 40, 
                       "2020-01-03" ,  30     ,  40     , 20     , 20, 
                       "2020-01-04" ,  40     ,  20     , 15     , 40, 
                       "2020-01-05" ,  50     ,  40     , 18     , 10, 
                       "2020-01-06" ,  60     ,  50     , 20     , 15, 
                       "2020-01-07" ,  70     ,  60     , 40     , 72, 
                       "2020-01-08" ,  50     ,  80     , 60     , 90, 
                       "2020-01-09" ,  30     ,  30     , 90     , 17, 
                       "2020-01-10" ,  60     ,  50     , 18     , 15) 

dat$Dates <- lubridate::ymd(dat$Dates)

然后,我们可以将数据重新排列到列中的变量中,并定义将创建距离的函数。因为我们要将它与 outer() 一起使用,所以我将使用两个参数作为 X 矩阵的两个不同行。

X <- dat %>% select(-Dates) %>% as.matrix %>% t

edfun <- function(x,y){
  sum(sqrt((X[x, ] - X[y,])^2))
}

现在,我们可以计算距离并打印它们:

o1 <- outer(1:nrow(X), 1:nrow(X), Vectorize(edfun))
rownames(o1) <- colnames(o1) <- rownames(X)
o1
#        City A City B City C City D
# City A      0    120    269    235
# City B    120      0    209    195
# City C    269    209      0    196
# City D    235    195    196      0

现在,我们可以制作一个余弦距离函数并估计这些距离。

cdfun <- function(x,y){
  num <- sum(X[x,]*X[y, ])
  d1 <- sqrt(sum(X[x, ]^2))
  d2 <- sqrt(sum(X[y, ]^2))
  num/(d1*d2)
}
o1a <- outer(1:nrow(X), 1:nrow(X), Vectorize(cdfun))
rownames(o1a) <- colnames(o1a) <- rownames(X)
o1a
#           City A    City B    City C    City D
# City A 1.0000000 0.9521640 0.7400186 0.7913705
# City B 0.9521640 1.0000000 0.8109673 0.8805258
# City C 0.7400186 0.8109673 1.0000000 0.7674460
# City D 0.7913705 0.8805258 0.7674460 1.0000000

我们可以对更长的数据做同样的事情:


dat2 <- tibble::tribble( ~Dates    ,  ~City ,  ~Var1,  ~Var2,  ~Var3, 
                         "2020-01-01" , "A"     , 20   , 200  , 5    ,
                         "2020-01-02" , "A"     , 30   , 300  , 3    ,
                         "2020-01-03" , "A"     , 40   , 220  , 4    ,
                         "2020-01-04" , "A"     , 20   , 150  , 2    ,
                         "2020-01-05" , "A"     , 40   , 180  , 5    ,
                         "2020-01-01" , "B"     , 50   , 200  , 6    ,
                         "2020-01-02" , "B"     , 60   , 400  , 7    ,
                         "2020-01-03" , "B"     , 80   , 600  , 8    ,
                         "2020-01-04" , "B"     , 30   , 900  , 4    ,
                         "2020-01-05" , "B"     , 50   , 180  , 2    ,
                         "2020-01-01" , "C"     , 20   , 230  , 3    ,
                         "2020-01-02" , "C"     , 30   , 340  , 5    ,
                         "2020-01-03" , "C"     , 40   , 230  , 3    ,
                         "2020-01-04" , "C"     , 20   , 120  , 5    ,
                         "2020-01-05" , "C"     , 40   , 120  , 4    ,
                         "2020-01-01" , "D"     , 20   , 400  , 5    ,
                         "2020-01-02" , "D"     , 30   , 500  , 6    ,
                         "2020-01-03" , "D"     , 10   , 600  , 7    ,
                         "2020-01-04" , "D"     , 50   , 300  , 7    ,
                         "2020-01-05" , "D"     , 20   , 300  , 4  )


d2w <- dat2 %>% 
  pivot_wider(names_from="Dates", 
              values_from=c("Var1", "Var2", "Var3"))

X2 <- d2w %>% select(-City) %>% as.matrix
rownames(X2) <- paste0("City ", d2w$City)
edfun2 <- function(x,y){
  sum(sqrt((X2[x, ] - X2[y,])^2))
}

o2 <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(edfun2))
rownames(o2) <- colnames(o2) <- rownames(X2)
o2
#        City A City B City C City D
# City A      0   1364    179   1142
# City B   1364      0   1433   1208
# City C    179   1433      0   1149
# City D   1142   1208   1149      0



cdfun2 <- function(x,y){
  num <- sum(X2[x,]*X2[y, ])
  d1 <- sqrt(sum(X2[x, ]^2))
  d2 <- sqrt(sum(X2[y, ]^2))
  num/(d1*d2)
}

o2a <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(cdfun2))
rownames(o2a) <- colnames(o2a) <- rownames(X2)
o2a
#           City A    City B    City C    City D
# City A 1.0000000 0.8051685 0.9861522 0.9742238
# City B 0.8051685 1.0000000 0.7617596 0.8338688
# City C 0.9861522 0.7617596 1.0000000 0.9637144
# City D 0.9742238 0.8338688 0.9637144 1.0000000