根据2列中的值去重a table + 模糊匹配
Deduplicate a table based on values in 2 columns + fuzzy matching
我有一个从 Zotero 导出的 CSV 文件,其中包含我的图书馆条目的元数据。我知道它包含很多重复项,但要摆脱它们并不容易:
并非所有具有相似标题的项目实际上都是重复的,例如
| Year | Author | Title |
+------+-------------------------------+--------------+
| 2016 | Jones, Erik | Book Reviews |
| 2016 | Hassner, Pierre; Jones, Erik | Book Reviews |
| 2010 | Adams, Laura L.; Gagnon, Chip | Book Reviews |
并非所有实际上相似的项目都具有 100% 相同的元数据字符串,例如
| Author | Title |
+---------------+-----------------------------------------------+
| Tichý, Lukáš; | Can Iran Reduce EU Dependence on Russian Gas? |
| Tichy, L.; | "can iran reduce eu dependence onrussian gas" |
这是一个极端的例子(差异通常不会那么大),但是如您所见,pre-cleaning并不能完全解决这个问题;所以我们的想法是消除在两个以上的列中包含 similar 值的行 - 例如,"Author" 和 "Title".
到目前为止我已经 tried/looked 了:
- OpenRefine - 几乎不熟悉它,所以无法想出或找到任何可行的方法。
- Excel fuzzy lookup extension - 并没有按照我需要的方式工作。
- Python - 再一次,我的语言不好;我找不到任何相关的 solutions/guides.
- R:尝试了一些想法:
首先在 "Author" 列的 for 循环中使用 agrep 来获取具有重复行的索引;然后对 "Title" 列做同样的事情;然后比较向量并对值重合的行进行重复数据删除。不用说,我无法超越第 1 步:
titles <- unlist(corpus$"Title")
for (i in 1:length(titles)){
Title_dupe_temp <- agrep(titles[i], titles[i+1:length(titles)],
max.distance = 1, ignore.case = TRUE, fixed = FALSE)
Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ")
}
结果(几乎)完全是乱码;另外我收到警告消息:
In Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ") :
number of items to replace is not a multiple of replacement length
我也通读了 fuzzywuzzyR 文档,但没有找到任何有用的函数。
最后,我尝试了 RecordLinkage package. Still, I could not go past the basics. The documentation is rather heavy and not explicit on all things; guides are scarce, and the ones I've found (e.g. this) 使用已准备好标识向量的示例数据集 - 所以我不知道如何在我的数据上复制它。
所以在这一点上我不在乎是否在OpenRefine/R/Py/SQL/whatever中做,只是以任何方式做。
解决方案一:
使用循环和库 stringdist
:
library(stringdist)
zotero<-data.frame(
Year=c(2016,2016,2010,2010,2010,2010),
Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
)
zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)
zotero<-zotero[order(zotero[,1],zotero[,2]),]
atot<-NULL
for (i in 2:dim(zotero)[1]){
a<-stringdist(zotero$onestring[i-1],zotero$onestring[i])/(nchar(zotero$onestring[i-1])+nchar(zotero$onestring[i]))
atot<-rbind(atot,a)
}
zotero<-cbind(zotero,threshold=c(1,atot))
zotero[zotero$threshold>0.15,]
解决方案 II:使用矩阵计算可能比使用循环更快:首先我根据您的数据样本创建数据框,其次我删除非 UTF 字符,第三我使用库 stringdist
来计算距离矩阵。您可以轻松地将这些转换为相似度百分比。
zotero<-data.frame(
Year=c(2016,2016,2010,2010,2010,2010),
Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
)
zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)
Encoding(zotero$onestring) <- "UTF-8"
zotero$onestring<-iconv(zotero$onestring, "UTF-8", "UTF-8",sub='')
library(stringdist)
stringdistmatrix(zotero$onestring)
结果:
> stringdistmatrix(zotero$onestring)
1 2 3 4 5
2 11
3 13 14
4 47 45 44
5 47 45 44 0
6 47 45 42 13 13
我对@Nakx 有类似的方法,我喜欢矩阵解决方案。但是,您也可以尝试使用 gsub
和 iconv
进行更多清理,并使用 sapply 进行匹配(索引不是其本身的最佳匹配值..0)。像这样:
> library(RecordLinkage)
>
> zotero<-data.frame(
+ Year=c(2016,2016,2010,2010,2010,2010),
+ Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
+ Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
+ )
>
> # Converting the special characters
> zotero$Author_new <- iconv(zotero$Author, from = '', to = "ASCII//TRANSLIT")
> zotero$Author_new <- tolower(zotero$Author_new)
> zotero$Author_new <- gsub("[[:punct:]]", "", zotero$Author_new)
>
> # Removing punctuation making it lowercase
> zotero$Title_new <- gsub("[[:punct:]]", "", zotero$Title)
> zotero$Title_new <- tolower(zotero$Title_new)
>
> # Removing exact duplicates
> dups <- duplicated(zotero[,c("Title_new", "Author_new", "Year")])
> zotero <- zotero[!dups,]
> zotero
Year Author Title Author_new
1 2016 Jones, Erik Book Reviews jones erik
2 2016 Hassner, Pierre; Book Reviews hassner pierre
3 2010 Adams, Laura L.; Book Reviews adams laura l
4 2010 Tichý, Lukáš; Can Iran Reduce EU Dependence on Russian Gas? tichy lukas
6 2010 Tichy, L.; can iran reduce eu dependence onrussian gas tichy l
Title_new Title_dist Author_dist
1 book reviews 0 9
2 book reviews 0 9
3 book reviews 0 9
4 can iran reduce eu dependence on russian gas 0 0
6 can iran reduce eu dependence onrussian gas 1 4
>
> # Creating a distance measure for your title, author, and year
> zotero$Title_dist <- sapply(zotero$Title_new, function(x) sort(levenshteinDist(x, zotero$Title_new))[2])
> zotero$Author_dist <- sapply(zotero$Author_new, function(x) sort(levenshteinDist(x, zotero$Author_new))[2])
>
> # Filter here
从那里您可以使用距离变量来创建条件和过滤器。例如,如果一篇文章的作者距离为 2 且标题距离为 5,您可能会觉得很容易删除。
编辑以阐明过滤示例。查看数据后,您需要进行调整。开始保守总是好的
> library(dplyr)
> zotero <- zotero %>%
+ group_by(Year) %>%
+ filter(!between(Title_dist, 1, 5) |
+ !between(Author_dist, 1, 5))
> zotero
# A tibble: 4 x 7
# Groups: Year [2]
Year Author Title Author_new Title_new Title_dist Author_dist
<dbl> <fct> <fct> <chr> <chr> <int> <int>
1 2016 Jones, Erik Book Reviews jones erik book reviews 0 9
2 2016 Hassner, Pi~ Book Reviews hassner pie~ book reviews 0 9
3 2010 Adams, Laur~ Book Reviews adams laura~ book reviews 0 9
4 2010 Tichý, Luká~ Can Iran Reduce EU Depen~ tichy lukas can iran reduce eu depende~ 0 0
我有一个从 Zotero 导出的 CSV 文件,其中包含我的图书馆条目的元数据。我知道它包含很多重复项,但要摆脱它们并不容易:
并非所有具有相似标题的项目实际上都是重复的,例如
| Year | Author | Title | +------+-------------------------------+--------------+ | 2016 | Jones, Erik | Book Reviews | | 2016 | Hassner, Pierre; Jones, Erik | Book Reviews | | 2010 | Adams, Laura L.; Gagnon, Chip | Book Reviews |
并非所有实际上相似的项目都具有 100% 相同的元数据字符串,例如
| Author | Title | +---------------+-----------------------------------------------+ | Tichý, Lukáš; | Can Iran Reduce EU Dependence on Russian Gas? | | Tichy, L.; | "can iran reduce eu dependence onrussian gas" |
这是一个极端的例子(差异通常不会那么大),但是如您所见,pre-cleaning并不能完全解决这个问题;所以我们的想法是消除在两个以上的列中包含 similar 值的行 - 例如,"Author" 和 "Title".
到目前为止我已经 tried/looked 了:
- OpenRefine - 几乎不熟悉它,所以无法想出或找到任何可行的方法。
- Excel fuzzy lookup extension - 并没有按照我需要的方式工作。
- Python - 再一次,我的语言不好;我找不到任何相关的 solutions/guides.
- R:尝试了一些想法:
首先在 "Author" 列的 for 循环中使用 agrep 来获取具有重复行的索引;然后对 "Title" 列做同样的事情;然后比较向量并对值重合的行进行重复数据删除。不用说,我无法超越第 1 步:
titles <- unlist(corpus$"Title")
for (i in 1:length(titles)){
Title_dupe_temp <- agrep(titles[i], titles[i+1:length(titles)],
max.distance = 1, ignore.case = TRUE, fixed = FALSE)
Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ")
}
结果(几乎)完全是乱码;另外我收到警告消息:
In Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ") :
number of items to replace is not a multiple of replacement length
我也通读了 fuzzywuzzyR 文档,但没有找到任何有用的函数。
最后,我尝试了 RecordLinkage package. Still, I could not go past the basics. The documentation is rather heavy and not explicit on all things; guides are scarce, and the ones I've found (e.g. this) 使用已准备好标识向量的示例数据集 - 所以我不知道如何在我的数据上复制它。
所以在这一点上我不在乎是否在OpenRefine/R/Py/SQL/whatever中做,只是以任何方式做。
解决方案一:
使用循环和库 stringdist
:
library(stringdist)
zotero<-data.frame(
Year=c(2016,2016,2010,2010,2010,2010),
Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
)
zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)
zotero<-zotero[order(zotero[,1],zotero[,2]),]
atot<-NULL
for (i in 2:dim(zotero)[1]){
a<-stringdist(zotero$onestring[i-1],zotero$onestring[i])/(nchar(zotero$onestring[i-1])+nchar(zotero$onestring[i]))
atot<-rbind(atot,a)
}
zotero<-cbind(zotero,threshold=c(1,atot))
zotero[zotero$threshold>0.15,]
解决方案 II:使用矩阵计算可能比使用循环更快:首先我根据您的数据样本创建数据框,其次我删除非 UTF 字符,第三我使用库 stringdist
来计算距离矩阵。您可以轻松地将这些转换为相似度百分比。
zotero<-data.frame(
Year=c(2016,2016,2010,2010,2010,2010),
Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
)
zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)
Encoding(zotero$onestring) <- "UTF-8"
zotero$onestring<-iconv(zotero$onestring, "UTF-8", "UTF-8",sub='')
library(stringdist)
stringdistmatrix(zotero$onestring)
结果:
> stringdistmatrix(zotero$onestring)
1 2 3 4 5
2 11
3 13 14
4 47 45 44
5 47 45 44 0
6 47 45 42 13 13
我对@Nakx 有类似的方法,我喜欢矩阵解决方案。但是,您也可以尝试使用 gsub
和 iconv
进行更多清理,并使用 sapply 进行匹配(索引不是其本身的最佳匹配值..0)。像这样:
> library(RecordLinkage)
>
> zotero<-data.frame(
+ Year=c(2016,2016,2010,2010,2010,2010),
+ Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
+ Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
+ )
>
> # Converting the special characters
> zotero$Author_new <- iconv(zotero$Author, from = '', to = "ASCII//TRANSLIT")
> zotero$Author_new <- tolower(zotero$Author_new)
> zotero$Author_new <- gsub("[[:punct:]]", "", zotero$Author_new)
>
> # Removing punctuation making it lowercase
> zotero$Title_new <- gsub("[[:punct:]]", "", zotero$Title)
> zotero$Title_new <- tolower(zotero$Title_new)
>
> # Removing exact duplicates
> dups <- duplicated(zotero[,c("Title_new", "Author_new", "Year")])
> zotero <- zotero[!dups,]
> zotero
Year Author Title Author_new
1 2016 Jones, Erik Book Reviews jones erik
2 2016 Hassner, Pierre; Book Reviews hassner pierre
3 2010 Adams, Laura L.; Book Reviews adams laura l
4 2010 Tichý, Lukáš; Can Iran Reduce EU Dependence on Russian Gas? tichy lukas
6 2010 Tichy, L.; can iran reduce eu dependence onrussian gas tichy l
Title_new Title_dist Author_dist
1 book reviews 0 9
2 book reviews 0 9
3 book reviews 0 9
4 can iran reduce eu dependence on russian gas 0 0
6 can iran reduce eu dependence onrussian gas 1 4
>
> # Creating a distance measure for your title, author, and year
> zotero$Title_dist <- sapply(zotero$Title_new, function(x) sort(levenshteinDist(x, zotero$Title_new))[2])
> zotero$Author_dist <- sapply(zotero$Author_new, function(x) sort(levenshteinDist(x, zotero$Author_new))[2])
>
> # Filter here
从那里您可以使用距离变量来创建条件和过滤器。例如,如果一篇文章的作者距离为 2 且标题距离为 5,您可能会觉得很容易删除。
编辑以阐明过滤示例。查看数据后,您需要进行调整。开始保守总是好的
> library(dplyr)
> zotero <- zotero %>%
+ group_by(Year) %>%
+ filter(!between(Title_dist, 1, 5) |
+ !between(Author_dist, 1, 5))
> zotero
# A tibble: 4 x 7
# Groups: Year [2]
Year Author Title Author_new Title_new Title_dist Author_dist
<dbl> <fct> <fct> <chr> <chr> <int> <int>
1 2016 Jones, Erik Book Reviews jones erik book reviews 0 9
2 2016 Hassner, Pi~ Book Reviews hassner pie~ book reviews 0 9
3 2010 Adams, Laur~ Book Reviews adams laura~ book reviews 0 9
4 2010 Tichý, Luká~ Can Iran Reduce EU Depen~ tichy lukas can iran reduce eu depende~ 0 0