与连续点的成对相似性

Question

我在 doc2vec 包中使用 paragraph2vec_similarity 创建了一个很大的文档相似度矩阵。我将其转换为数据框并在开头添加了一个 TITLE 列以便稍后对其进行排序或分组。

当前虚拟输出：

Title	Header	DocName_1900.txt_1	DocName_1900.txt_2	DocName_1900.txt_3	DocName_1901.txt_1	DocName_1901.txt_2
Doc1	DocName_1900.txt_1	1.000000	0.7369358	0.6418045	0.6268959	0.6823404
Doc1	DocName_1900.txt_2	0.7369358	1.000000	0.6544884	0.7418507	0.5174367
Doc1	DocName_1900.txt_3	0.6418045	0.6544884	1.000000	0.6180578	0.5274650
Doc2	DocName_1901.txt_1	0.6268959	0.7418507	0.6180578	1.000000	0.5755243
Doc2	DocName_1901.txt_2	0.6823404	0.5174367	0.5274650	0.5755243	1.000000

我想要的是一个数据框，它为每个后续文档提供连续顺序的相似性。即Doc1.1和Doc1.2的分数；以及 Doc1.2 和 Doc1.3。因为我只对每个单独文档内的相似度分数感兴趣——按照对角线顺序，如上面粗体所示。

预期输出

Title	Similarity for 1-2	Similarity for 2-3	Similarity for 3-4
Doc1	0.7369358	0.6544884	NA
Doc2	0.5755243	NA	NA	NA
Doc3	0.6049844	0.5250659	0.5113757

我能够生成一个给出一个文档与其余所有文档 x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m)) 的相似度分数的文档。这是我能得到的最接近的。有没有更好的办法？因为我正在处理超过 500 个不同长度的标题。仍然有使用 diag 的选项，但它会将所有内容都放到矩阵的末尾，并且我松散了文档分组。

Answer 1

如果我正确理解了你的问题，tidyverse 中的一种可能解决方案是使数据变长，从标题和 Header 中删除前导字母，将它们按点拆分并通过比较进行过滤结果。数据再次变宽后，最后生成一个新的列作为列名：

library(tidyverse)

# set up / read in dummy data
df <- data.table::fread("Title  Header  Doc1.1  Doc1.2  Doc1.3  Doc2.1  Doc2.2
Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")

df %>%
    tidyr::pivot_longer(-c(Title, Header)) %>% 
    dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+"))) %>%
    tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
    tidyr::separate(name, sep = "\.", into = c("s1","s2")) %>% 
    dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
    dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
    tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)


# A tibble: 2 x 3
  Title `Similarity for 1 - 2` `Similarity for 2 - 3`
  <chr>                  <dbl>                  <dbl>
1 Doc1                   0.737                  0.654
2 Doc2                   0.576                 NA

根据新的列名进行编辑（需要更多的字符串操作）：

library(tidyverse)

# set up / read in dummy data
df <- data.table::fread("Title  Header  DocName_1900.txt_1  DocName_1900.txt_2  DocName_1900.txt_3  DocName_1901.txt_1  DocName_1901.txt_2
Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")

df %>%
    tidyr::pivot_longer(-c(Title, Header)) %>% 
    dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+_*"))) %>%
    tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
    tidyr::separate(name, sep = "\.txt_", into = c("s1","s2")) %>% 
    dplyr::mutate(s1 = as.numeric(s1)-1899) %>%
    dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
    dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
    tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)

# A tibble: 2 x 3
  Title `Similarity for 1 - 2` `Similarity for 2 - 3`
  <chr>                  <dbl>                  <dbl>
1 Doc1                   0.737                  0.654
2 Doc2                   0.576                 NA

Answer 2

另一个解决方案：

df %>%
  group_by(Title) %>%
  summarize(name = embed(Header, 2), .groups = 'drop') %>%
  mutate(value = transform(df, row.names = Header)[name],
         name = str_remove_all(paste(name[,2],name[,1], sep = '_'), '[^_]+[.]'))%>%
  pivot_wider()

# A tibble: 2 x 3
  Title `1_2`     `2_3`    
  <chr> <chr>     <chr>    
1 Doc1  0.7369358 0.6544884
2 Doc2  0.5755243 NA

与连续点的成对相似性

pairwise similarity with consecutive points

grouping

r

similarity

doc2vec

pairwise