有没有办法使用 dplyr distinct() 将相似的值视为相等?
Is there a way to use dplyr distinct() to consider similar values as equal?
我必须对发表在 20,000 多种期刊列表中的科学论文进行分析。我的列表有超过 450,000 条记录,但有几处重复(例如:来自不同机构的不止一位作者的论文出现不止一次)。
好吧,我需要计算每个期刊的不同论文数量,但问题是不同的作者并不总是以相同的方式提供信息,我可以得到类似下面的内容 table:
JOURNAL PAPER
0001-1231 A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING.
0001-1231 DECYCLING WITH A MATCHING
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232 DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232 OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232 ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232 DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION
我的目标是使用类似的东西:
data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())
所以,我会得到如下信息:
JOURNAL papers_in_journal
0001-1231 6
0001-1232 7
问题是您可以看到发表的论文名称有一些错误。有些最后有一个 "period" ;有些有空格或替换符号;有些还有其他微小的变化,例如 W[1]-HARD 与 W-HARD。所以,如果我按原样 运行 代码,我所拥有的是:
JOURNAL papers_in_journal
0001-1231 10
0001-1232 10
我的问题:有没有办法在使用 distinct() 或类似命令时考虑相似性余量,这样我就可以得到类似 distinct(JOURNAL, PAPER %whithin% 0.95) 的东西?
从这个意义上说,我要命令考虑:
A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.
etc.
我想没有使用 distinct() 的简单解决方案,而且我找不到任何替代命令来执行此操作。所以,如果不可能,你可以建议我可能使用的任何消歧算法,我也很感激。
谢谢。
一种选择是使用 agrep
和 lapply
来查找差异≤10% 的期刊文章的 指数 ([= 的默认值15=],您可以使用 max.distance
参数更改),然后取每篇文章的第一篇文章并使用 sapply
对其进行矢量化,得到 unique
索引,矢量的长度,并用 tapply
将其全部包裹起来 select 每个期刊中 "dissimilar" 篇文章的数量。
tapply(data$PAPER, data$JOURNAL, FUN=function(x) {
length(unique(sapply(lapply(x, function(y) agrep(y, x) ), "[", 1))
} )
# 0001-1231 0001-1232
# 6 8
对于 dplyr 版本,returns 结果格式更好,我将上面的代码放在一个函数中,然后使用 group_by()
后跟 summarise()
.
dissimilar <- function(x, distance=0.1) {
length(unique(sapply(lapply(x, function(y)
agrep(y, x, max.distance = distance) ), "[", 1)))
}
根据 agrep
的文档定义 "dissimilar"。
library(dplyr)
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 2 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6
2 0001-1232 8
然而,对于更大的数据集,例如包含数千种期刊和 450,000 多篇文章的数据集,上述操作会相当慢(在我的 2.50GHz Intel 上大约需要 10-15 分钟)。我意识到 dissimilar
函数不必要地将每一行与其他每一行进行比较,这毫无意义。理想情况下,每一行只应与其自身和所有 剩余行 进行比较。例如,第一本期刊在第 8-12 行包含 5 篇非常相似的文章。在第 8 行使用 agrep
returns 所有 5 个索引,因此无需将第 9-12 行与任何其他行进行比较。因此,我用 for 循环替换了 lapply
,现在处理 450,000 行的数据集只需要 2-3 分钟。
dissimilar <- function(x, distance=0.1) {
lst <- list() # initialise the list
k <- 1:length(x) # k is the index of PAPERS to compare with
for(i in k){ # i = each PAPER, k = itself and all remaining
lst[[i]] <- agrep(x[i], x[k], max.distance = distance) + i - 1
# + i - 1 ensures that the original index in x is maintained
k <- k[!k %in% lst[[i]]] # remove elements which are similar
}
lst <- sapply(lst, "[", 1) # take only the first of each item in the list
length(na.omit(lst)) # count number of elements
}
现在扩展原始示例数据集,使包含大约 18,000 篇期刊的 450,000 条记录,每篇包含大约 25 篇文章。
n <- 45000
data2 <- do.call("rbind", replicate(round(n/26), data, simplify=FALSE))[1:n,]
data2$JOURNAL[27:n] <- rep(paste0("0002-", seq(1, n/25)), each=25)[1:(n-26)]
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 18,001 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6 # <-- Same
2 0001-1232 8
3 0002-1 14
4 0002-10 14
5 0002-100 14
6 0002-1000 13
7 0002-10000 14
8 0002-10001 14
9 0002-10002 14
10 0002-10003 14
# ... with 17,991 more rows
挑战在于找到一种方法来进一步加快该过程。
您将要使用用于自然语言处理的包。尝试包 quanteda。
我必须对发表在 20,000 多种期刊列表中的科学论文进行分析。我的列表有超过 450,000 条记录,但有几处重复(例如:来自不同机构的不止一位作者的论文出现不止一次)。
好吧,我需要计算每个期刊的不同论文数量,但问题是不同的作者并不总是以相同的方式提供信息,我可以得到类似下面的内容 table:
JOURNAL PAPER
0001-1231 A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING.
0001-1231 DECYCLING WITH A MATCHING
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232 DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232 OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232 ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232 DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION
我的目标是使用类似的东西:
data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())
所以,我会得到如下信息:
JOURNAL papers_in_journal
0001-1231 6
0001-1232 7
问题是您可以看到发表的论文名称有一些错误。有些最后有一个 "period" ;有些有空格或替换符号;有些还有其他微小的变化,例如 W[1]-HARD 与 W-HARD。所以,如果我按原样 运行 代码,我所拥有的是:
JOURNAL papers_in_journal
0001-1231 10
0001-1232 10
我的问题:有没有办法在使用 distinct() 或类似命令时考虑相似性余量,这样我就可以得到类似 distinct(JOURNAL, PAPER %whithin% 0.95) 的东西?
从这个意义上说,我要命令考虑:
A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.
etc.
我想没有使用 distinct() 的简单解决方案,而且我找不到任何替代命令来执行此操作。所以,如果不可能,你可以建议我可能使用的任何消歧算法,我也很感激。
谢谢。
一种选择是使用 agrep
和 lapply
来查找差异≤10% 的期刊文章的 指数 ([= 的默认值15=],您可以使用 max.distance
参数更改),然后取每篇文章的第一篇文章并使用 sapply
对其进行矢量化,得到 unique
索引,矢量的长度,并用 tapply
将其全部包裹起来 select 每个期刊中 "dissimilar" 篇文章的数量。
tapply(data$PAPER, data$JOURNAL, FUN=function(x) {
length(unique(sapply(lapply(x, function(y) agrep(y, x) ), "[", 1))
} )
# 0001-1231 0001-1232
# 6 8
对于 dplyr 版本,returns 结果格式更好,我将上面的代码放在一个函数中,然后使用 group_by()
后跟 summarise()
.
dissimilar <- function(x, distance=0.1) {
length(unique(sapply(lapply(x, function(y)
agrep(y, x, max.distance = distance) ), "[", 1)))
}
根据 agrep
的文档定义 "dissimilar"。
library(dplyr)
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 2 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6
2 0001-1232 8
然而,对于更大的数据集,例如包含数千种期刊和 450,000 多篇文章的数据集,上述操作会相当慢(在我的 2.50GHz Intel 上大约需要 10-15 分钟)。我意识到 dissimilar
函数不必要地将每一行与其他每一行进行比较,这毫无意义。理想情况下,每一行只应与其自身和所有 剩余行 进行比较。例如,第一本期刊在第 8-12 行包含 5 篇非常相似的文章。在第 8 行使用 agrep
returns 所有 5 个索引,因此无需将第 9-12 行与任何其他行进行比较。因此,我用 for 循环替换了 lapply
,现在处理 450,000 行的数据集只需要 2-3 分钟。
dissimilar <- function(x, distance=0.1) {
lst <- list() # initialise the list
k <- 1:length(x) # k is the index of PAPERS to compare with
for(i in k){ # i = each PAPER, k = itself and all remaining
lst[[i]] <- agrep(x[i], x[k], max.distance = distance) + i - 1
# + i - 1 ensures that the original index in x is maintained
k <- k[!k %in% lst[[i]]] # remove elements which are similar
}
lst <- sapply(lst, "[", 1) # take only the first of each item in the list
length(na.omit(lst)) # count number of elements
}
现在扩展原始示例数据集,使包含大约 18,000 篇期刊的 450,000 条记录,每篇包含大约 25 篇文章。
n <- 45000
data2 <- do.call("rbind", replicate(round(n/26), data, simplify=FALSE))[1:n,]
data2$JOURNAL[27:n] <- rep(paste0("0002-", seq(1, n/25)), each=25)[1:(n-26)]
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 18,001 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6 # <-- Same
2 0001-1232 8
3 0002-1 14
4 0002-10 14
5 0002-100 14
6 0002-1000 13
7 0002-10000 14
8 0002-10001 14
9 0002-10002 14
10 0002-10003 14
# ... with 17,991 more rows
挑战在于找到一种方法来进一步加快该过程。
您将要使用用于自然语言处理的包。尝试包 quanteda。