与连续点的成对相似性

pairwise similarity with consecutive points

我在 doc2vec 包中使用 paragraph2vec_similarity 创建了一个很大的文档相似度矩阵。我将其转换为数据框并在开头添加了一个 TITLE 列以便稍后对其进行排序或分组。

当前虚拟输出:

Title Header DocName_1900.txt_1 DocName_1900.txt_2 DocName_1900.txt_3 DocName_1901.txt_1 DocName_1901.txt_2
Doc1 DocName_1900.txt_1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 DocName_1900.txt_2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 DocName_1900.txt_3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 DocName_1901.txt_1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 DocName_1901.txt_2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000

我想要的是一个数据框,它为每个后续文档提供连续顺序的相似性。即Doc1.1和Doc1.2的分数;以及 Doc1.2 和 Doc1.3。因为我只对每个单独文档内的相似度分数感兴趣——按照对角线顺序,如上面粗体所示。

预期输出

Title Similarity for 1-2 Similarity for 2-3 Similarity for 3-4
Doc1 0.7369358 0.6544884 NA
Doc2 0.5755243 NA NA NA
Doc3 0.6049844 0.5250659 0.5113757

我能够生成一个给出一个文档与其余所有文档 x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m)) 的相似度分数的文档。这是我能得到的最接近的。有没有更好的办法?因为我正在处理超过 500 个不同长度的标题。仍然有使用 diag 的选项,但它会将所有内容都放到矩阵的末尾,并且我松散了文档分组。

如果我正确理解了你的问题,tidyverse 中的一种可能解决方案是使数据变长,从标题和 Header 中删除前导字母,将它们按点拆分并通过比较进行过滤结果。数据再次变宽后,最后生成一个新的列作为列名:

library(tidyverse)

# set up / read in dummy data
df <- data.table::fread("Title  Header  Doc1.1  Doc1.2  Doc1.3  Doc2.1  Doc2.2
Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")

df %>%
    tidyr::pivot_longer(-c(Title, Header)) %>% 
    dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+"))) %>%
    tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
    tidyr::separate(name, sep = "\.", into = c("s1","s2")) %>% 
    dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
    dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
    tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)


# A tibble: 2 x 3
  Title `Similarity for 1 - 2` `Similarity for 2 - 3`
  <chr>                  <dbl>                  <dbl>
1 Doc1                   0.737                  0.654
2 Doc2                   0.576                 NA    

根据新的列名进行编辑(需要更多的字符串操作):

library(tidyverse)

# set up / read in dummy data
df <- data.table::fread("Title  Header  DocName_1900.txt_1  DocName_1900.txt_2  DocName_1900.txt_3  DocName_1901.txt_1  DocName_1901.txt_2
Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")

df %>%
    tidyr::pivot_longer(-c(Title, Header)) %>% 
    dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+_*"))) %>%
    tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
    tidyr::separate(name, sep = "\.txt_", into = c("s1","s2")) %>% 
    dplyr::mutate(s1 = as.numeric(s1)-1899) %>%
    dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
    dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
    tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)

# A tibble: 2 x 3
  Title `Similarity for 1 - 2` `Similarity for 2 - 3`
  <chr>                  <dbl>                  <dbl>
1 Doc1                   0.737                  0.654
2 Doc2                   0.576                 NA    

另一个解决方案:

df %>%
  group_by(Title) %>%
  summarize(name = embed(Header, 2), .groups = 'drop') %>%
  mutate(value = transform(df, row.names = Header)[name],
         name = str_remove_all(paste(name[,2],name[,1], sep = '_'), '[^_]+[.]'))%>%
  pivot_wider()

# A tibble: 2 x 3
  Title `1_2`     `2_3`    
  <chr> <chr>     <chr>    
1 Doc1  0.7369358 0.6544884
2 Doc2  0.5755243 NA