与连续点的成对相似性
pairwise similarity with consecutive points
我在 doc2vec
包中使用 paragraph2vec_similarity
创建了一个很大的文档相似度矩阵。我将其转换为数据框并在开头添加了一个 TITLE 列以便稍后对其进行排序或分组。
当前虚拟输出:
Title
Header
DocName_1900.txt_1
DocName_1900.txt_2
DocName_1900.txt_3
DocName_1901.txt_1
DocName_1901.txt_2
Doc1
DocName_1900.txt_1
1.000000
0.7369358
0.6418045
0.6268959
0.6823404
Doc1
DocName_1900.txt_2
0.7369358
1.000000
0.6544884
0.7418507
0.5174367
Doc1
DocName_1900.txt_3
0.6418045
0.6544884
1.000000
0.6180578
0.5274650
Doc2
DocName_1901.txt_1
0.6268959
0.7418507
0.6180578
1.000000
0.5755243
Doc2
DocName_1901.txt_2
0.6823404
0.5174367
0.5274650
0.5755243
1.000000
我想要的是一个数据框,它为每个后续文档提供连续顺序的相似性。即Doc1.1和Doc1.2的分数;以及 Doc1.2 和 Doc1.3。因为我只对每个单独文档内的相似度分数感兴趣——按照对角线顺序,如上面粗体所示。
预期输出
Title
Similarity for 1-2
Similarity for 2-3
Similarity for 3-4
Doc1
0.7369358
0.6544884
NA
Doc2
0.5755243
NA
NA
NA
Doc3
0.6049844
0.5250659
0.5113757
我能够生成一个给出一个文档与其余所有文档 x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m))
的相似度分数的文档。这是我能得到的最接近的。有没有更好的办法?因为我正在处理超过 500 个不同长度的标题。仍然有使用 diag
的选项,但它会将所有内容都放到矩阵的末尾,并且我松散了文档分组。
如果我正确理解了你的问题,tidyverse
中的一种可能解决方案是使数据变长,从标题和 Header 中删除前导字母,将它们按点拆分并通过比较进行过滤结果。数据再次变宽后,最后生成一个新的列作为列名:
library(tidyverse)
# set up / read in dummy data
df <- data.table::fread("Title Header Doc1.1 Doc1.2 Doc1.3 Doc2.1 Doc2.2
Doc1 Doc1.1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 Doc1.2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 Doc1.3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 Doc2.1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 Doc2.2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000")
df %>%
tidyr::pivot_longer(-c(Title, Header)) %>%
dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+"))) %>%
tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
tidyr::separate(name, sep = "\.", into = c("s1","s2")) %>%
dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>%
dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>%
tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
# A tibble: 2 x 3
Title `Similarity for 1 - 2` `Similarity for 2 - 3`
<chr> <dbl> <dbl>
1 Doc1 0.737 0.654
2 Doc2 0.576 NA
根据新的列名进行编辑(需要更多的字符串操作):
library(tidyverse)
# set up / read in dummy data
df <- data.table::fread("Title Header DocName_1900.txt_1 DocName_1900.txt_2 DocName_1900.txt_3 DocName_1901.txt_1 DocName_1901.txt_2
Doc1 Doc1.1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 Doc1.2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 Doc1.3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 Doc2.1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 Doc2.2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000")
df %>%
tidyr::pivot_longer(-c(Title, Header)) %>%
dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+_*"))) %>%
tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
tidyr::separate(name, sep = "\.txt_", into = c("s1","s2")) %>%
dplyr::mutate(s1 = as.numeric(s1)-1899) %>%
dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>%
dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>%
tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
# A tibble: 2 x 3
Title `Similarity for 1 - 2` `Similarity for 2 - 3`
<chr> <dbl> <dbl>
1 Doc1 0.737 0.654
2 Doc2 0.576 NA
另一个解决方案:
df %>%
group_by(Title) %>%
summarize(name = embed(Header, 2), .groups = 'drop') %>%
mutate(value = transform(df, row.names = Header)[name],
name = str_remove_all(paste(name[,2],name[,1], sep = '_'), '[^_]+[.]'))%>%
pivot_wider()
# A tibble: 2 x 3
Title `1_2` `2_3`
<chr> <chr> <chr>
1 Doc1 0.7369358 0.6544884
2 Doc2 0.5755243 NA
我在 doc2vec
包中使用 paragraph2vec_similarity
创建了一个很大的文档相似度矩阵。我将其转换为数据框并在开头添加了一个 TITLE 列以便稍后对其进行排序或分组。
当前虚拟输出:
Title | Header | DocName_1900.txt_1 | DocName_1900.txt_2 | DocName_1900.txt_3 | DocName_1901.txt_1 | DocName_1901.txt_2 |
---|---|---|---|---|---|---|
Doc1 | DocName_1900.txt_1 | 1.000000 | 0.7369358 | 0.6418045 | 0.6268959 | 0.6823404 |
Doc1 | DocName_1900.txt_2 | 0.7369358 | 1.000000 | 0.6544884 | 0.7418507 | 0.5174367 |
Doc1 | DocName_1900.txt_3 | 0.6418045 | 0.6544884 | 1.000000 | 0.6180578 | 0.5274650 |
Doc2 | DocName_1901.txt_1 | 0.6268959 | 0.7418507 | 0.6180578 | 1.000000 | 0.5755243 |
Doc2 | DocName_1901.txt_2 | 0.6823404 | 0.5174367 | 0.5274650 | 0.5755243 | 1.000000 |
我想要的是一个数据框,它为每个后续文档提供连续顺序的相似性。即Doc1.1和Doc1.2的分数;以及 Doc1.2 和 Doc1.3。因为我只对每个单独文档内的相似度分数感兴趣——按照对角线顺序,如上面粗体所示。
预期输出
Title | Similarity for 1-2 | Similarity for 2-3 | Similarity for 3-4 | |
---|---|---|---|---|
Doc1 | 0.7369358 | 0.6544884 | NA | |
Doc2 | 0.5755243 | NA | NA | NA |
Doc3 | 0.6049844 | 0.5250659 | 0.5113757 |
我能够生成一个给出一个文档与其余所有文档 x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m))
的相似度分数的文档。这是我能得到的最接近的。有没有更好的办法?因为我正在处理超过 500 个不同长度的标题。仍然有使用 diag
的选项,但它会将所有内容都放到矩阵的末尾,并且我松散了文档分组。
如果我正确理解了你的问题,tidyverse
中的一种可能解决方案是使数据变长,从标题和 Header 中删除前导字母,将它们按点拆分并通过比较进行过滤结果。数据再次变宽后,最后生成一个新的列作为列名:
library(tidyverse)
# set up / read in dummy data
df <- data.table::fread("Title Header Doc1.1 Doc1.2 Doc1.3 Doc2.1 Doc2.2
Doc1 Doc1.1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 Doc1.2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 Doc1.3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 Doc2.1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 Doc2.2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000")
df %>%
tidyr::pivot_longer(-c(Title, Header)) %>%
dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+"))) %>%
tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
tidyr::separate(name, sep = "\.", into = c("s1","s2")) %>%
dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>%
dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>%
tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
# A tibble: 2 x 3
Title `Similarity for 1 - 2` `Similarity for 2 - 3`
<chr> <dbl> <dbl>
1 Doc1 0.737 0.654
2 Doc2 0.576 NA
根据新的列名进行编辑(需要更多的字符串操作):
library(tidyverse)
# set up / read in dummy data
df <- data.table::fread("Title Header DocName_1900.txt_1 DocName_1900.txt_2 DocName_1900.txt_3 DocName_1901.txt_1 DocName_1901.txt_2
Doc1 Doc1.1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404
Doc1 Doc1.2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367
Doc1 Doc1.3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650
Doc2 Doc2.1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243
Doc2 Doc2.2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000")
df %>%
tidyr::pivot_longer(-c(Title, Header)) %>%
dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z]+_*"))) %>%
tidyr::separate(Header, sep = "\.", into = c("f1","f2")) %>%
tidyr::separate(name, sep = "\.txt_", into = c("s1","s2")) %>%
dplyr::mutate(s1 = as.numeric(s1)-1899) %>%
dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>%
dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>%
tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)
# A tibble: 2 x 3
Title `Similarity for 1 - 2` `Similarity for 2 - 3`
<chr> <dbl> <dbl>
1 Doc1 0.737 0.654
2 Doc2 0.576 NA
另一个解决方案:
df %>%
group_by(Title) %>%
summarize(name = embed(Header, 2), .groups = 'drop') %>%
mutate(value = transform(df, row.names = Header)[name],
name = str_remove_all(paste(name[,2],name[,1], sep = '_'), '[^_]+[.]'))%>%
pivot_wider()
# A tibble: 2 x 3
Title `1_2` `2_3`
<chr> <chr> <chr>
1 Doc1 0.7369358 0.6544884
2 Doc2 0.5755243 NA