有没有办法使用 dplyr distinct() 将相似的值视为相等?

Is there a way to use dplyr distinct() to consider similar values as equal?

我必须对发表在 20,000 多种期刊列表中的科学论文进行分析。我的列表有超过 450,000 条记录,但有几处重复(例如:来自不同机构的不止一位作者的论文出现不止一次)。

好吧,我需要计算每个期刊的不同论文数量,但问题是不同的作者并不总是以相同的方式提供信息,我可以得到类似下面的内容 table:

JOURNAL          PAPER
0001-1231        A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231        A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231        THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231        THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231        COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231        COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231        AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231        DECYCLING WITH A MATCHING
0001-1231        DECYCLING WITH A MATCHING
0001-1231        DECYCLING WITH A MATCHING
0001-1231        DECYCLING WITH A MATCHING.
0001-1231        DECYCLING WITH A MATCHING
0001-1231        ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231        ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232        DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232        AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232        AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH 
0001-1232        COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232        OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232        FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232        FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232        ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232        DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232        GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232        GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232        GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION

我的目标是使用类似的东西:

data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())

所以,我会得到如下信息:

JOURNAL      papers_in_journal
0001-1231    6
0001-1232    7

问题是您可以看到发表的论文名称有一些错误。有些最后有一个 "period" ;有些有空格或替换符号;有些还有其他微小的变化,例如 W[1]-HARD 与 W-HARD。所以,如果我按原样 运行 代码,我所拥有的是:

JOURNAL      papers_in_journal
0001-1231    10
0001-1232    10

我的问题:有没有办法在使用 distinct() 或类似命令时考虑相似性余量,这样我就可以得到类似 distinct(JOURNAL, PAPER %whithin% 0.95) 的东西?

从这个意义上说,我要命令考虑:

A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS

THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH

DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.

etc.

我想没有使用 distinct() 的简单解决方案,而且我找不到任何替代命令来执行此操作。所以,如果不可能,你可以建议我可能使用的任何消歧算法,我也很感激。

谢谢。

一种选择是使用 agreplapply 来查找差异≤10% 的期刊文章的 指数 ([= 的默认值15=],您可以使用 max.distance 参数更改),然后取每篇文章的第一篇文章并使用 sapply 对其进行矢量化,得到 unique 索引,矢量的长度,并用 tapply 将其全部包裹起来 select 每个期刊中 "dissimilar" 篇文章的数量。

  tapply(data$PAPER, data$JOURNAL, FUN=function(x) {
      length(unique(sapply(lapply(x, function(y) agrep(y, x) ), "[", 1))
     } )

# 0001-1231 0001-1232 
#         6         8 

对于 dplyr 版本,returns 结果格式更好,我将上面的代码放在一个函数中,然后使用 group_by() 后跟 summarise().

dissimilar <- function(x, distance=0.1) {
  length(unique(sapply(lapply(x, function(y) 
     agrep(y, x, max.distance = distance) ), "[", 1)))
}

根据 agrep 的文档定义 "dissimilar"。

library(dplyr)

data2 %>%
  group_by(JOURNAL) %>%
  summarise(n=dissimilar(PAPER))

# A tibble: 2 x 2
  JOURNAL       n
  <chr>     <int>
1 0001-1231     6
2 0001-1232     8

然而,对于更大的数据集,例如包含数千种期刊和 450,000 多篇文章的数据集,上述操作会相当慢(在我的 2.50GHz Intel 上大约需要 10-15 分钟)。我意识到 dissimilar 函数不必要地将每一行与其他每一行进行比较,这毫无意义。理想情况下,每一行只应与其自身和所有 剩余行 进行比较。例如,第一本期刊在第 8-12 行包含 5 篇非常相似的文章。在第 8 行使用 agrep returns 所有 5 个索引,因此无需将第 9-12 行与任何其他行进行比较。因此,我用 for 循环替换了 lapply,现在处理 450,000 行的数据集只需要 2-3 分钟。

dissimilar <- function(x, distance=0.1) {
  lst <- list()               # initialise the list
  k <- 1:length(x)            # k is the index of PAPERS to compare with
  for(i in k){                # i = each PAPER, k = itself and all remaining
    lst[[i]] <- agrep(x[i], x[k], max.distance = distance) + i - 1 
                              # + i - 1 ensures that the original index in x is maintained
    k <- k[!k %in% lst[[i]]]  # remove elements which are similar
  }
  lst <- sapply(lst, "[", 1)  # take only the first of each item in the list
  length(na.omit(lst))        # count number of elements
}

现在扩展原始示例数据集,使包含大约 18,000 篇期刊的 450,000 条记录,每篇包含大约 25 篇文章。

n <- 45000
data2 <- do.call("rbind", replicate(round(n/26), data, simplify=FALSE))[1:n,]
data2$JOURNAL[27:n] <- rep(paste0("0002-", seq(1, n/25)), each=25)[1:(n-26)]

data2 %>%
  group_by(JOURNAL) %>%
  summarise(n=dissimilar(PAPER))

# A tibble: 18,001 x 2
   JOURNAL        n
   <chr>      <int>
 1 0001-1231      6 # <-- Same
 2 0001-1232      8
 3 0002-1        14
 4 0002-10       14
 5 0002-100      14
 6 0002-1000     13
 7 0002-10000    14
 8 0002-10001    14
 9 0002-10002    14
10 0002-10003    14

# ... with 17,991 more rows

挑战在于找到一种方法来进一步加快该过程。

您将要使用用于自然语言处理的包。尝试包 quanteda。